[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread Smalyshev
Smalyshev added a comment.

> In this case, one would not be able to distinguish this from the case where 
> two statements with two qualifiers each had been given originally


It is possible to distinguish them since claim IDs are recorded too for 
bookkeeping, so the split claim would have same IDs while different claims 
would have different IDs. I'm still not sure why this distinction is important 
though.

> My point was that an attacker could craft a single statement that makes you 
> index millions of statements.


It is easy to introduce limits if this would be of any concern. Since our data 
does not have any large numbers, limiting expansion factor by, say, 50 or so 
would not impact the system and would prevent such problems.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread Smalyshev
Smalyshev added a comment.

**This changes the structure, and the original structure is no longer 
represented and can no longer be faithfully recovered.**

This is not correct, original structure can be recovered, though I see no 
reason why would you want to do so. Can you name one?

**a mere 40 qualifiers (20 properties, each with two values) on one forged 
statement, I could create a million statements in your index -- a possible DOS 
attack vector**

Scan of the database shows there are no entries generating more than 15 
qualifier splits (at least I couldn't find any a month ago). If it ever becomes 
a problem, we could easily institute limits, but I think vandalism is better 
handled on other levels than changing our data model to avoid vandals.  I see 
no case where 20 duplicate qualifiers would legitimately be required - 
duplicate qualifier is usually a wrong way to represent the claim, since it 
essentially claims that the same event happened in two places, two times, etc. 
- which usually means what should there be is two separate claims, as event 
happening two times is two different instances of the event. So I would claim 
even most existing duplicates look like data errors.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread Smalyshev
Smalyshev added a comment.

**This changes the structure, and the original structure is no longer 
represented and can no longer be faithfully recovered.**

This is not correct, original structure can be recovered, though I see no 
reason why would you want to do so. Can you name one?

**a mere 40 qualifiers (20 properties, each with two values) on one forged 
statement, I could create a million statements in your index -- a possible DOS 
attack vector**

Scan of the database shows there are no entries generating more than 15 
qualifier splits (at least I couldn't find any a month ago). If it ever becomes 
a problem, we could easily institute limits, but I think vandalism is better 
handled on other levels than changing our data model to avoid vandals.  I see 
no case where 20 duplicate qualifiers would legitimately be required - 
duplicate qualifier is usually a wrong way to represent the claim, since it 
essentially claims that the same event happened in two places, two times, etc. 
- which usually means what should there be is two separate claims, as event 
happening two times is two different instances of the event. So I would claim 
even most existing duplicates look like data errors.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

@JanZerebecki I understand what you are saying about what "indexing" means 
here. Makes sense to me. What you are saying about my example query sounds as 
if you are planning to implement query execution manually. I hope this is not 
the case and you can just give the query to Titan to get it answered for you.

You mentioned splitting certain statements with duplicate qualifiers. This 
changes the structure, and the original structure is no longer represented and 
can no longer be faithfully recovered. I don't know if this is an issue with 
the current data (which duplicate qualifiers are actually used?) but it is an 
issue in general. It also means that with a mere 40 qualifiers (20 properties, 
each with two values) on one forged statement, I could create a million 
statements in your index -- a possible DOS attack vector?

An alternative technique for handling such cases is to use a special encoding 
for overflowing values. This is done successfully in some database 
implementations, but it may mean additional checks during query answering, and 
depending on how low in the query answering process you can do these checks, 
they may or may not add significant cost (hence it's a pity that Titan does not 
support this efficiently out of the box).


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev My point is merely that sitelinks and labels //can// be handled like 
statements. Since statements must be supported anyway, it would be sensible to 
reuse the data structures and query expressions defined for them. I don't think 
that confusion is likely, since the query language will not use the colloquial 
names as my examples. Properties of Wikidata will always be referred to by 
their Pid, whereas something like "has badge" would not have an id of this 
form. So it's not like having a reserved label "has badge" that competes with 
Wikidata property labels.

Structurally, however, statements and sitelinks can all be represented in the 
same data structure as far as querying is concerned. Maybe you would like it 
better if you viewed it as a separate, independent data structure "qualified 
triple" that we would use to represent statements, sitelinks, and labels? There 
is no conceptual mix-up between the high-level terms here, just taking 
advantage of similar structure at a low level. If you look at common query 
languages like SQL, SPARQL, Cypher, etc. then you can see that they are always 
based on a relatively small set of structural primitives that do not have a 
domain-specific meaning. You can always build UIs that use domain specific 
terms like "sitelink" and that make them appear separate, but for implementers 
and API users it is very useful if some things can be unified.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread Smalyshev
Smalyshev added a comment.

@mkroetzsch the issue is that "//point in time//" is an official property in 
wikidata, which can be looked up and associated with property //P585//. 
However, "//has badge//" is not a property and as such can not be expressed in 
these terms. Of course, we can make a language that would have some logic - 
either database-backed or just whitelist-based - that would identify that 
"//has badge//" is a special case and create special translation for it. That 
would complicate the engine, but it's not the main issue. I wonder if it is 
really what the expectations of the users would be. After all, we display links 
and badges in different form than claims in the main wikidata UI - wouldn't the 
users of the same UI expect different handling of the sitelinks and badges in 
the query API too?

**any query that would work over statements would also be expected to work for 
sitelinks**

I'm not sure we want to make that claim, because I don't think it is true. The 
data model is different for sitelinks as opposed to claims - claims have 
values, qualifiers and references, while sitelinks have badges. In fact, 
there's very little in common between the two of them except for the fact that 
they are both attached to items. So I am concerned that claiming that would be 
rather confusing.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev My suggestion was just about the surface appearance, not about the 
inner workings. I am saying that the following two phrases have the same 
structure:

- "Find things with a *sitelink* that *has badge* *featured*".
- "Find things with a *population* that has *point in time* *2014*".

If you look at it like this, you use "badge" like a (special) qualifier 
property and "featured" like a value. This does not mean that the query 
answering will be any less efficient than with another syntax. The query engine 
would easily parse the queries in any case, without ambiguity, and know that 
sitelinks are (possibly) stored in a different way internally. Same for labels. 
The reason I was suggesting this unification here was that it also somehow 
answers the question "what do we mean by 'indexing' this data?": any query that 
would work over statements would also be expected to work for sitelinks, even 
if different structures are used internally.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread JanZerebecki
JanZerebecki added a comment.

I understood "indexing everything" to mean that at least support answering a 
query for it with an exact value and traversing to things it is connected to 
without iterating over anything else. (Technically for Titan a composite index 
over one key, see 
http://s3.thinkaurelius.com/docs/titan/current/indexes.html#_composite_index .) 
I.e. support finding badges "Featured Article" and also support traversing to 
the site link one is connected to and then which item that is connected to. So 
this does not imply any fancy indices (like over functions) nor even full text, 
prefix or range indices.
Does it sound OK to have this as a baseline?

If there are no other indices that means e.g. that we can answer "Find people 
whose parents are married to each other." only by looking at every human that 
is married (which so far is indexed by exact value for each of the lookups on 
its own; i.e. no combined index for married humans) and then traversing to 
their children where both parents are in the previous set of humans (which is 
AFAIK currently not helped by any index).

A question about qualifiers: currently if there are two qualifiers with the 
same property on the same statement, it will be split when importing into Titan 
like as if it were two statements, is this something that is acceptable or 
should we correct this now?


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, JanZerebecki
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

In https://phabricator.wikimedia.org/T86278#969184, @Multichill wrote:

> I would like to turn it around. We should support indexing everything:


...

> The fact that we're not creative enough to make up queries for everything 
> doesn't mean it isn't useful.


I have to disagree with this approach of designing a (query) system. Of course 
supporting "everything" would be nice, but indexing always needs to be viewed 
in the context of a query language. Depending on the query features you 
support, "indexing" may mean completely different things. The requirement that 
"everything should be indexed" is fuzzy and vague.

For example, Wikidata Query supports regular path queries (in Wikidata, 
Kleene-star recursion is accessed with an operator called "TREE"). When you say 
that references should be indexed, do you mean that it should be possible to 
navigate through references within regular path expressions? How about 
qualifiers? I think Wikidata query only supports TREE in the main statements. 
On the other hand, Wikidata query does not support any kind of cyclic queries 
("Find people whose parents are married to each other.") even though the 
relevant data is indexed. The point is that you need adequate index structures 
for each query feature. You may have "indexed everything" and still not be able 
to do the queries you want.

Moreover, I am creative enough to come up with queries that would be very hard 
to implement, yet this does not mean that we should do it. Both "everything" 
and "everything we are creative enough for" are poor design principles.

So how to move forward? The usual approach to design a practical system is to 
collect use cases and requirements (example queries) and then make a clear 
decision what should be supported and what shouldn't. One can always revise 
this decision later, but it's still much better to make it explicit than to 
just go along and see what we get. In all of this, it must be understood that 
supporting "everything" is an impossible task. The question is merely where to 
draw the line(s).


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread Multichill
Multichill added a comment.

I would like to turn it around. We should support indexing everything:

- Labels
- Aliases
- Descriptions
- Sitelinks
  - Including badges
- Statements
  - Including ranks etc
  - Including qualifiers
  - Including references

(probably forgot something)

We should have a good reason to not index something. The fact that we're not 
creative enough to make up queries for everything doesn't mean it isn't useful.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Multichill
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-09 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.

Didn't you have qualifiers in the proposal already? Anyway. They should be in 
;-)

Some examples:
References: "Give me all statements referenced to Nature" or "Give me all 
cities in Germany over 2 Million inhabitants based on the national statistics 
office" 
Aliases: The same as labels
Sitelinks: "Give me all people that have an article on Malayalam Wikipedia"
Badges: "Give me all people that have an excellent article on English Wikipedia"


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Lydia_Pintscher
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-09 Thread Smalyshev
Smalyshev added a comment.

Only things explicitly mentioned in the wiki are covered now. So, qualifiers 
are covered, but references, sitelinks, badges and aliases are not.

I'd like to hear more what lookups would we do against sitelinks, aliases and 
references - do we have any existing examples?


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-09 Thread Lydia_Pintscher
Lydia_Pintscher added a comment.

Yeah they are.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, Lydia_Pintscher
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-08 Thread Smalyshev
Smalyshev added a comment.

Here we would like some feedback from wikidata team.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, jkroll, Wikidata-bugs, aude, GWicke, Manybubbles, 
daniel



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs