Re: [Wikidata-l] Data values

Marco Fleckinger Wed, 19 Dec 2012 06:32:51 -0800


On 2012-12-19 15:11, Daniel Kinzler wrote:

On 19.12.2012 14:34, Friedrich Röhrs wrote:

Hi,

Sorry for my ignorance, if this is common knowledge: What is the use case for
sorting millions of different measures from different objects?


Finding all cities with more than 100000 inhabitants requires the database to
look through all values for the property "population" (or even all properties
with countable values, depending on implementation an query planning), compare
each value with "100000" and return those with a greater value. To speed this
up, an index sorted by this value would be needed.

To be added by multiple simultaneous sorting operations.

For cars there could be entries by the manufacturer, by some
car-testing magazine, etc. I don't see how this could be adequatly
represented/sorted by a database only query.


If this cannot be done adequatly on the database level, then it cannot be done
efficiently, which means we will not allow it. So our task is to come up with an
architecture that does allow this.

(One way to allow "scripted" queries like this to run efficiently is to do this
in a massively parallel way, using a map/reduce framework. But that's also not
trivial, and would require a whole new server infrastructure).

Software developers are not allowed to just think of the status quo theyalso have to think of use case the solution might gonna be used.

There is e.g. the idea of pushing the monuments lists into wikidata.Only in Austria there are 36.000-37.000 of those. Germany is much biggerbut has a similar history with probably an equal number per squarekilometers. Sorting this by distance to a specific place needs to bedone by the database. Everything else will be too ineffective.

If however this is necessary, i still don't understand why it must affect the
datavalue structure. If a index is necessary it could be done over a serialized
representation of the value.


"Serialized" can mean a lot of things, but an index on some data blob is only
useful for exact matches, it can not be used for greater/lesser queries. We need
to map our values to scalar data types the database can understand directly, and
use for indexing.

+1

This needs to be done anyway, since the values are
saved at a specific unit (which is just a wikidata item). To compare them on a
database level they must all be saved at the same unit, or some sort of
procedure must be used to compare them (or am i missing something again?).


If they measure the same dimension, they should be saved using the same unit
(probably the SI base unit for that dimension). Saving values using different
units would make it impossible to run efficient queries against these values,
thereby defying one of the major reasons for Wikidata's existance. I don't see a
way around this.

IMHO this should be part of a model. E.g. Altitudes are usually measuredin metres or feet, never in km or yards. Distances have the same SI baseunit but are measured also measured in km, depending of the use case.

Maybe we should make a difference between internal usage andvisualization. Comparing meters with kilometers and feet is quitedifficult, transcaling everything on visualization not.


Cheers

Marco

_______________________________________________
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l

Re: [Wikidata-l] Data values

Reply via email to