[Wikidata] Re: WoolNet: New application to find connections between Wikidata entities

2023-07-31 Thread Aidan Hogan

Hi Lydia, all,

Many thanks for the feedback on the system!

For those wanting just to play with the system you can try it here:

https://woolnet.dcc.uchile.cl/

The survey itself is here:

https://forms.gle/sCNqrAtJo98388iC6

(Just in case someone wishes to answer the survey, it might be best to 
do so before reading on further.)


On 2023-07-27 04:59, Lydia Pintscher wrote:

On Wed, Jul 26, 2023 at 11:17 PM Platonides  wrote:

That's great!

... Where can we find this application?


The tool is linked on the second page of the survey Aidan linked.

@Aiden and Cristóbal: Great tool. Thanks for putting it together!  One
thing I noticed that can maybe be improved is that it seems to focus
on relations between two nodes first before moving on to other nodes.
I tried China, Brazil and Germany and the timer ended when it had only
added connections between two of them (diplomatic relations, there are
_a lot_ of them...).


Thanks, yes, that's a useful test-case! I think with the way the 
algorithm proceeds it might not be trivial to address this, but we 
understand the issue. Regarding there being a lot of diplomatic 
relations, that's a second issue that we've discussed; we do think it 
would be useful to somehow give the users the power to filter some of 
the connections that are formed by relation type, for example. This 
might come in a later iteration if there is sufficient interest in 
further developing the tool. It would not be difficult in a technical 
sense, but it's a question of how to do it in a usable way. :)


I created an issue here for the former improvement:

https://github.com/TinSlim/WD-PathFinder/issues/7

Best!
Aidan
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/G3NY6XR4T74D2QLXL64NGCONHEZBB6DP/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] WoolNet: New application to find connections between Wikidata entities

2023-07-26 Thread Aidan Hogan

Hi everyone,

Cristóbal Torres, in CC, has been working in the past few months on a 
new web application for Wikidata called WoolNet. The application helps 
users to find and visualise connections between Wikidata entities. We 
are pleased to share this application with you, and would love to hear 
your thoughts. We've prepared a quick survey, which links to the system:


https://forms.gle/sCNqrAtJo98388iC6

We are also happy for you to forward this email or the system to 
whomever you think might be interested.


The application was developed by Cristóbal in the context of his 
undergraduate thesis here in the University of Chile.


Thoughts, comments, questions, etc., very welcome!

Best,
Aidan
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/3GLYYUHB5J73DJT7E5XD64TOLTECIILN/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Re: Wikidata Atlas: a geographic view of Wikidata entities [feedback welcome!]

2022-12-16 Thread Aidan Hogan

Hi Diego,

Thanks for the pointer; this is very cool! We would be happy to share 
experiences. (It's very impressive how many points you are able to 
render, and how these resize at different scales!)


Indeed it seems we were not so original with the name. :)

It seems both systems offer two different functionalities: one focuses 
on the "what's close to here" functionality, while the other focuses on 
the "where in the world are there X" functionality, like "where in the 
world are there lighthouses [1]", but generalised to all the types in 
Wikidata. It would be interesting to see how these two modalities could 
be combined in future maybe?


Best,
Aidan

[1] https://www.lightphotos.net/photos/map_all.php

On 2022-12-16 21:45, Diego Saez-Trumper wrote:

Hi Aidan,

With Tassos and Rossano, we have a similar project (same name in fact). 
You can check-it out here: www.wiki-atlas.org 
, maybe we could exchange some experiences 
about it.


Best,
Diego

___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/4ZE6CWFNHOYQH47DEXJKT7J4P2ASSQVN/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org

___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/H2JPDXIET5I2ZXZOUXQNJDCAF4FXJAOP/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] Wikidata Atlas: a geographic view of Wikidata entities [feedback welcome!]

2022-12-16 Thread Aidan Hogan

Hi all,

Benjamín, in CC, is an undergraduate student who has been working on a 
system and interface called "Wikidata Atlas". The system allows the user 
to search for different types of entities (with geo-coordinates) on 
Wikidata and visualise them on a world map.


The system is available here:

https://wdatlas.dcc.uchile.cl/

(A query I found interesting was "nuclear weapons test (Q210112)", for 
example.)


Feedback is very welcome!! To help us to evaluate and improve the tool, 
we would be very grateful if you could fill the following quick survey:


https://forms.gle/AN2LTuiQ1pzamfVHA

Best,
Aidan & Benjamín
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/I6NIUOZ2KDEUZDBJ5N7H5DYY2DHYIRPP/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


[Wikidata] New Question Answering tool for Wikidata [feedback needed]

2022-12-13 Thread Aidan Hogan

Hi all,

Francisca in CC is an undergraduate student who has been working in the
past few months on a new template-based Question Answering (QA) tool for
Wikidata called Templet, which is available here:

https://templet.dcc.uchile.cl/

We would be *very* grateful if you could help us to evaluate and improve
the tool by answering the following quick questionnaire that Francisca
has prepared (should take only a few minutes):

https://docs.google.com/forms/d/e/1FAIpQLSeuqtS8jbTOXFNsnVwTEYf_vk2zXPHj8FococAzdPQCk1hIBw/viewform


Templet is based on questions from QAWiki (http://qawiki.org/), which
anyone can add questions/queries to. But QAWiki is perhaps a subject for
a follow-up email later. For now we would be very grateful for your 
feedback on Templet itself via the survey linked above. :)


Best regards,
Aidan & Francisca
___
Wikidata mailing list -- wikidata@lists.wikimedia.org
Public archives at 
https://lists.wikimedia.org/hyperkitty/list/wikidata@lists.wikimedia.org/message/UMXRUMDDBB7R5OJ75DAXX5M4GK3L4EWF/
To unsubscribe send an email to wikidata-le...@lists.wikimedia.org


Re: [Wikidata] 2 million queries against a Wikidata instance

2020-07-23 Thread Aidan Hogan

Hi Adam,

On 2020-07-13 13:41, Adam Sanchez wrote:

Hi,

I have to launch 2 million queries against a Wikidata instance.
I have loaded Wikidata in Virtuoso 7 (512 RAM, 32 cores, SSD disks with RAID 0).
The queries are simple, just 2 types.

select ?s ?p ?o {
?s ?p ?o.
filter (?s = ?param)
}

select ?s ?p ?o {
?s ?p ?o.
filter (?o = ?param)
}

If I use a Java ThreadPoolExecutor takes 6 hours.
How can I speed up the queries processing even more?


Perhaps I am a bit late to respond.

It's not really clear to me what you are aiming for, but if this is a 
once-off task, I would recommend to download the dump in Turtle or 
N-Triples, load your two million parameters in memory in a sorted or 
hashed data structure in the programming language of your choice (should 
take considerably less than 1GB of memory assuming typical constants), 
use a streaming RDF parser for that language, and for each 
subject/object, check if its in your list in memory. This solution is 
about as good as you can get in terms of once-off batch processing.


If your idea is to index the data so you can do 2 million lookups in 
"interactive time", your problem is not what software to use, it's what 
hardware to use.


Traditional hard disks have a physical arm that takes maybe 5-10 ms to 
move. Sold state disks are quite a bit better but still have seeks in 
the range of 0.1 ms. Multiply those seek times by 2 million and you have 
a long wait (caching will help, as will multiple disks, but not by 
nearly enough). You would need to get the data into main memory (RAM) to 
have any chance of approximating interactive times, and even still you 
will probably not get interactive runtimes without leveraging some 
further assumptions about what you want to do to optimise further (e.g., 
if you're only interesting in Q ids, you can use integers or bit 
vectors, etc). In the most general case, you would probably need to 
pre-filter the data as much as you can, and also use as much compression 
as you can (ideally with compact data structures) to get the data into 
memory on one machine, or you might think about something like Redis 
(in-memory key-value store) on lots of machines. Essentially, if your 
goal is interactive times on millions of lookups, you very likely need 
to look at options purely in RAM (unless you have thousands of disks 
available at least). The good news is that 512GB(?) sounds like a lot of 
space to store stuff in.


Best,
Aidan


I was thinking :

a) to implement a Virtuoso cluster to distribute the queries or
b) to load Wikidata in a Spark dataframe (since Sansa framework is
very slow, I would use my own implementation) or
c) to load Wikidata in a Postgresql table and use Presto to distribute
the queries or
d) to load Wikidata in a PG-Strom table to use GPU parallelism.

What do you think? I am looking for ideas.
Any suggestion will be appreciated.

Best,

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Concise/Notable Wikidata Dump

2019-12-19 Thread Aidan Hogan

Hey all,

Just a general response to all the comments thus far.

- @Marco et al., regarding the WDumper by Benno, this is a very cool 
initiative! In fact I spotted it just *after* posting so I think this 
goes quite some ways towards addressing the general issue raised.


- @Markus, I partially disagree regarding the importance of 
rubber-stamping a "notable dump" on the Wikidata side. I would see it's 
value as being something like the "truthy dump", which I believe has 
been widely used in research for working with a concise sub-set of 
Wikidata. Perhaps a middle ground is for a sporadic "notable dump" to be 
generated by WDumper and published on Zenodo. This may be sufficient in 
terms of making the dump available and reusable for research purposes 
(or even better than the current dumps, given the permanence you 
mention). Also it would reduce costs on the Wikidata side (I don't think 
a notable dump would be necessary to generate on a weekly basis, for 
example).


- @Lydia, good point! I was thinking that filtering by wikilinks will 
just drop some more obscure nodes (like Q51366847 for example), but had 
not considered that there are some more general "concepts" that do not 
have a corresponding Wikipedia article. All the same, in a lot of the 
research we use Wikidata for, we are not particularly interested in one 
thing or another, but more interested in facilitating what other people 
are interested in. Examples would be query performance, finding paths, 
versioning, finding references, etc. But point taken! Maybe there is a 
way to identify "general entities" that do not have wikilinks, but do 
have a high degree or centrality, for example? Would a degree-based or 
centrality-based filter be possible in something like WDumper (perhaps 
it goes beyond the original purpose; certainly it does not seem trivial 
in terms of resources used)? Would it be a good idea?


In summary, I like the idea of using WDumper to sporadically generate -- 
and publish on Zenodo -- a "notable version" of Wikidata filtered by 
sitelinks (perhaps also allowing other high-degree or high-PageRank 
nodes to pass the filter). At least I know I would use such a dump.


Best,
Aidan

On 2019-12-19 6:46, Lydia Pintscher wrote:

On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan  wrote:


Hey all,

As someone who likes to use Wikidata in their research, and likes to
give students projects relating to Wikidata, I am finding it more and
more difficult to (recommend to) work with recent versions of Wikidata
due to the increasing dump sizes, where even the truthy version now
costs considerable time and machine resources to process and handle. In
some cases we just grin and bear the costs, while in other cases we
apply an ad hoc sampling to be able to play around with the data and try
things quickly.

More generally, I think the growing data volumes might inadvertently
scare people off taking the dumps and using them in their research.

One idea we had recently to reduce the data size for a student project
while keeping the most notable parts of Wikidata was to only keep claims
that involve an item linked to Wikipedia; in other words, if the
statement involves a Q item (in the "subject" or "object") not linked to
Wikipedia, the statement is removed.

I wonder would it be possible for Wikidata to provide such a dump to
download (e.g., in RDF) for people who prefer to work with a more
concise sub-graph that still maintains the most "notable" parts? While
of course one could compute this from the full-dump locally, making such
a version available as a dump directly would save clients some
resources, potentially encourage more research using/on Wikidata, and
having such a version "rubber-stamped" by Wikidata would also help to
justify the use of such a dataset for research purposes.

... just an idea I thought I would float out there. Perhaps there is
another (better) way to define a concise dump.

Best,
Aidan


Hi Aiden,

That the dumps are becoming too big is an issue I've heard a number of
times now. It's something we need to tackle. My biggest issue is
deciding how to slice and dice it though in a way that works for many
use cases. We have https://phabricator.wikimedia.org/T46581 to
brainstorm about that and figure it out. Input from several people
very welcome. I also added a link to Benno's tool there.
As for the specific suggestion: I fear relying on the existence of
sitelinks will kick out a lot of important things you would care about
like professions so I'm not sure that's a good thing to offer
officially for a larger audience.


Cheers
Lydia



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Concise/Notable Wikidata Dump

2019-12-17 Thread Aidan Hogan

Hey all,

As someone who likes to use Wikidata in their research, and likes to 
give students projects relating to Wikidata, I am finding it more and 
more difficult to (recommend to) work with recent versions of Wikidata 
due to the increasing dump sizes, where even the truthy version now 
costs considerable time and machine resources to process and handle. In 
some cases we just grin and bear the costs, while in other cases we 
apply an ad hoc sampling to be able to play around with the data and try 
things quickly.


More generally, I think the growing data volumes might inadvertently 
scare people off taking the dumps and using them in their research.


One idea we had recently to reduce the data size for a student project 
while keeping the most notable parts of Wikidata was to only keep claims 
that involve an item linked to Wikipedia; in other words, if the 
statement involves a Q item (in the "subject" or "object") not linked to 
Wikipedia, the statement is removed.


I wonder would it be possible for Wikidata to provide such a dump to 
download (e.g., in RDF) for people who prefer to work with a more 
concise sub-graph that still maintains the most "notable" parts? While 
of course one could compute this from the full-dump locally, making such 
a version available as a dump directly would save clients some 
resources, potentially encourage more research using/on Wikidata, and 
having such a version "rubber-stamped" by Wikidata would also help to 
justify the use of such a dataset for research purposes.


... just an idea I thought I would float out there. Perhaps there is 
another (better) way to define a concise dump.


Best,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Nobel Prizes and consensus in Wikidata

2019-09-27 Thread Aidan Hogan

Hey all,

Andra recently mentioned about finding laureates in Wikidata, and it 
reminded me that some weeks ago I was trying to come up with a SPARQL 
query to find all Nobel Prize Winners in Wikidata.


What I ended up with was:

SELECT ?winner
WHERE {
  ?winner wdt:P166 ?prize .
  ?prize (wdt:P361|wdt:P31|wdt:P279) wd:Q7191 .
}


More specifically, looking into the data I found:

Nobel Peace Prize (Q35637)
 part of (P361)
  Nobel Prize (Q7191) .

Nobel Prize in Literature (Q37922)
 subclass of (P279)
  Nobel Prize (Q7191) .

Nobel Prize in Economics (Q47170)
 instance of (P31)
   Nobel Prize (Q7191) ;
 part of (P361)
   Nobel Prize (Q7191) .

Nobel Prize in Chemistry (Q44585)
 instance of (P31)
   Nobel Prize (Q7191) ;
 part of (P361)
   Nobel Prize (Q7191) .

Nobel Prize in Physics (Q38104)
 subclass of (P31)
   Nobel Prize (Q7191) ;
 part of (P361)
   Nobel Prize (Q7191) .

In summary, of the six types of Nobel prizes, three different properties 
are used in five different combinations to state that they "are", in 
fact, Nobel prizes. :)


Now while it would be interesting to discuss the relative merits of P31 
vs. P279 vs. P361 vs. some combination thereof in this case and similar 
such cases, I guess I am more interested in the general problem of the 
lack of consensus that such a case exhibits.


What processes (be they social, technical, or some combination thereof) 
are currently in place to reach consensus in these cases in Wikidata?


What could be put in place in future to highlight and reach consensus?

Or is the idea more to leave the burden of "integrating" different 
viewpoints to the consumer (e.g., to the person writing the query)?


(Of course these are all "million dollar questions" that have been with 
the Semantic Web since the beginning, but I am curious about what is 
being done or can be done in the specific context of Wikidata to foster 
consensus and reduce heterogeneity in such cases.)


Best,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

2018-03-30 Thread Aidan Hogan
Apologies for that; just checking again now, it seems it had 404'ed 
again. We're not sure what's causing the issue at the moment, but we've 
added an hourly cron job to restart if it is not running. This should 
hopefully improve the availability until we figure out the cause.


Best,
Aidan

On 20-03-2018 10:51, José Ignacio Moreno wrote:

Fixed.
Thanks for letting us know.


El mar., 20 de mar. de 2018 a la(s) 08:23, Riccardo Tasso 
mailto:riccardo.ta...@gmail.com>> escribió:


404 Error on http://grafa.dcc.uchile.cl/

Could you please fix it?

2018-03-08 21:02 GMT+01:00 Aidan Hogan mailto:aid...@gmail.com>>:

Hi Joachim!

Understood, yes! The ability to select entities with any value
for some property is something I agree would be a useful next
step, and the use-case you mention of something like a
Wikidata/GraFa exploration of external resources is very
interesting! Thanks!

There are some technical reasons why it would not be so trivial
right now, mostly in the back-end relating to caching. GraFa
caches all facets for all possible queries that generate more
than 50,000 results. This is how we can generate the exact
facets, for example, for the 352,063 people from the U.S. in
under a second (the data are precomputed; if I recall correctly,
otherwise the query would take around 30 seconds). The ability
to query for existential values (a property with any value) will
increase a lot the number of queries we would have to cache. I
think it would still be feasible in the current system, but we
would have to do some work. In any case, I have added it to the
issue tracker!

(On a more organisational side, I'm just noticing now that
there's quite a few open issues on the tracker to look at, but
José Moreno, the masters student who did all the hard work on
GraFa, is currently finalising his thesis and preparing to
defend. Hence development on Grafa is paused for the moment but
I hope we can find a way to continue since the feedback has been
encouraging!)

Cheers,
Aidan


On 07-03-2018 9:04, Neubert, Joachim wrote:

Hi Aidan,

Thanks for your reply! My suggestion indeed was to feed in a
property (e.g. " P2611") - not a certain value or range of
values for that property - and to restrict the initial set
to all items where that property is set (e.g., all known TED
speakers in Wikidata).

That would allow to apply further faceting (e.g., according
to occupation or country) to that particular subset of
items. In effect it would offer an alternate view to the
original database (e.g., https://www.ted.com/speakers which
is organized by topic and by event). Thus, the full and
often very rich structured data of Wikidata could be used to
explore external datasets which are linked to WD via
external identifiers.

Being able to browse their own special collections by facets
from Wikidata could perhaps even offer an incentive to GLAM
institutions to contribute to Wikidata. It may turn out much
easier to add some missing data to Wikidata, in relation to
introducing a new field in their own database/search
interface, and populating it from scratch.

So I'd suggest that additional work invested into GraFa here
could pay out in a new pattern of use for both Wikidata and
collections linked by external identifiers.

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org
<mailto:wikidata-boun...@lists.wikimedia.org>] Im Auftrag
von Aidan Hogan
Gesendet: Mittwoch, 7. März 2018 06:16
An: wikidata@lists.wikimedia.org
<mailto:wikidata@lists.wikimedia.org>
Betreff: Re: [Wikidata] GraFa: Faceted browser for
RDF/Wikidata [thanks!]

Hi Joachim,

On 14-02-2018 7:32, Neubert, Joachim wrote:

Hi Aidan, hi José,

I'm a bit late - sorry!


Likewise! :)

What came to my mind as an perhaps easy extension: Can
or could the browser be seeded with an external property
(for example P2611, TED speaker ID)?

That would allow to browse some external dataset (e.g.,
all known TED speakers) by the facets provided by Wikidata.


Thanks for the suggestion! While it might seem an easy
extension, unfortunately that would actually require s

[Wikidata] Historical (RDF) dumps

2018-03-30 Thread Aidan Hogan

Hi all,

With a couple of students we are working on various topics relating to 
the dynamics of RDF and Wikidata. The public dumps in RDF cover the past 
couple of months:


https://dumps.wikimedia.org/wikidatawiki/entities/

I'm wondering is there a way to get access to older dumps or perhaps 
generate them from available data? We've been collecting dumps but it 
seems we have a gap for a dump on 2017/07/04 right in the middle of our 
collection. :) (If anyone has a copy of the truthy data for that 
particular month, I would be very grateful if they can reach out.)


In general, I think it would be fantastic to have a way to access all 
historical dumps. In particular, datasets might be used in papers and 
for reproducibility purposes it would be lift a burden from the authors 
to be able to link (rather than having to host) the data used. I am not 
sure if such an archive is feasible or not though.


Thanks,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

2018-03-08 Thread Aidan Hogan

Hi Joachim!

Understood, yes! The ability to select entities with any value for some 
property is something I agree would be a useful next step, and the 
use-case you mention of something like a Wikidata/GraFa exploration of 
external resources is very interesting! Thanks!


There are some technical reasons why it would not be so trivial right 
now, mostly in the back-end relating to caching. GraFa caches all facets 
for all possible queries that generate more than 50,000 results. This is 
how we can generate the exact facets, for example, for the 352,063 
people from the U.S. in under a second (the data are precomputed; if I 
recall correctly, otherwise the query would take around 30 seconds). The 
ability to query for existential values (a property with any value) will 
increase a lot the number of queries we would have to cache. I think it 
would still be feasible in the current system, but we would have to do 
some work. In any case, I have added it to the issue tracker!


(On a more organisational side, I'm just noticing now that there's quite 
a few open issues on the tracker to look at, but José Moreno, the 
masters student who did all the hard work on GraFa, is currently 
finalising his thesis and preparing to defend. Hence development on 
Grafa is paused for the moment but I hope we can find a way to continue 
since the feedback has been encouraging!)


Cheers,
Aidan

On 07-03-2018 9:04, Neubert, Joachim wrote:

Hi Aidan,

Thanks for your reply! My suggestion indeed was to feed in a property (e.g. " 
P2611") - not a certain value or range of values for that property - and to restrict 
the initial set to all items where that property is set (e.g., all known TED speakers in 
Wikidata).

That would allow to apply further faceting (e.g., according to occupation or 
country) to that particular subset of items. In effect it would offer an 
alternate view to the original database (e.g., https://www.ted.com/speakers 
which is organized by topic and by event). Thus, the full and often very rich 
structured data of Wikidata could be used to explore external datasets which 
are linked to WD via external identifiers.

Being able to browse their own special collections by facets from Wikidata 
could perhaps even offer an incentive to GLAM institutions to contribute to 
Wikidata. It may turn out much easier to add some missing data to Wikidata, in 
relation to introducing a new field in their own database/search interface, and 
populating it from scratch.

So I'd suggest that additional work invested into GraFa here could pay out in a 
new pattern of use for both Wikidata and collections linked by external 
identifiers.

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Aidan Hogan
Gesendet: Mittwoch, 7. März 2018 06:16
An: wikidata@lists.wikimedia.org
Betreff: Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

Hi Joachim,

On 14-02-2018 7:32, Neubert, Joachim wrote:

Hi Aidan, hi José,

I'm a bit late - sorry!


Likewise! :)


What came to my mind as an perhaps easy extension: Can or could the browser be 
seeded with an external property (for example P2611, TED speaker ID)?

That would allow to browse some external dataset (e.g., all known TED speakers) 
by the facets provided by Wikidata.


Thanks for the suggestion! While it might seem an easy extension, unfortunately 
that would actually require some significant changes since GraFa only considers 
values that have a label/alias we can auto-complete on (which in the case of 
Wikidata means, for the most part, Q* values).

While it would be great to support datatype/external properties, we figured 
that adding them to the system in a general and clean way would not be trivial! 
We assessed that some such properties require ranges (e.g., date-of-birth or 
height), some require autocomplete (e.g., first name), etc. ... and in the case 
of IDs, it's not clear that these are really useful for faceted browsing 
perhaps since they will jump to a specific value. Hence it gets messy to handle 
in the interface and even messier in the back-end.

(A separate issue is that of existential values ... finding entities that have 
some value for a property as your example requires. That would require some 
work, but would be more feasible!)

Best,
Aidan


-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im
Auftrag von Aidan Hogan
Gesendet: Donnerstag, 8. Februar 2018 21:33
An: Discussion list for the Wikidata project.
Cc: José Ignacio .
Betreff: Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata
[thanks!]

Hi all,

On behalf of José and myself, we would really like to thank the
people who tried out our system and gave us feedback!


Some aspects are left to work on (for example, we have not tested for
mobiles, etc.). However, we have made some minor initial changes
reflecting some of the comments w

Re: [Wikidata] Generating info-boxes from Wikidata: the importance of values!

2018-03-08 Thread Aidan Hogan

Hey Raphaël,

Thanks for the comments and the reference! And sorry we missed 
discussion of your paper (which indeed looks at largely the same problem 
in a slightly different context). If there's a next time, we will be 
sure to include it in the related work.


I am impressed btw to see a third-party evaluation of a Google tool. 
Also it seems Google has room for improvement. :)


Cheers,
Aidan

On 07-03-2018 13:43, Raphaël Troncy wrote:

Hey Aidan,

Great work, I loved it! You may want to (cite and) look at what we did 4 
years ago where we tried to reverse engineer a bit what Google is doing 
when choosing properties (and values) to show in its rich panels 
alongside popular entities.


The paper is entitled "What Are the Important Properties
of an Entity? Comparing Users and Knowledge Graph Point of View", 
https://www.eurecom.fr/~troncy/Publications/Assaf_Troncy-eswc14.pdf


... and the code is on github to replicate: 
https://github.com/ahmadassaf/KBE


   Raphaël

Le 07/03/2018 à 05:53, Aidan Hogan a écrit :

Hi all,

Tomás and I would like to share a paper that might be of interest to 
the community. It presents some preliminary results of a work looking 
at fully automated methods to generate Wikipedia info-boxes from 
Wikidata. The main focus is on deciding what information from Wikidata 
to include, and in what order. The results are based on asking users 
(students) to rate some prototypes of generated info-boxes.


Tomás Sáez, Aidan Hogan "Automatically Generating Wikipedia Infoboxes 
from Wikidata". In the Proceedings of the Wiki Workshop at WWW 2018, 
Lyon, France, April 24, 2018.


- Link: http://aidanhogan.com/docs/infobox-wikidata.pdf

We understand that populating info-boxes is an important goal of 
Wikidata and hence we thought we'd share some lessons learned.


Obviously a lot of work is being put into populating info-boxes from 
Wikidata, but the main methods at the moment seem to be template-based 
and require a lot of manual labour; plus the definition of these 
templates seems to be a difficult problem for classes such as person 
(where different information will have different priorities for people 
of different professions, notoriety, etc.).


We were just interested to see how far we could get with a fully 
automated approach using some generic ranking methods. Also we thought 
that something like this could perhaps be used to generate a "default" 
info-box for articles with no info-box and no associated template 
mapping. The paper presents preliminary results along those lines.


One interesting result is that a major factor in the evaluation of the 
generated info-boxes was the importance of the value. For example, 
Barack Obama has lots of awards, but perhaps only something like the 
Nobel Peace Prize might be of relevance to show in the info-box (<- 
being intended as an illustrative example rather than a concrete 
assertion of course!). Another example is that sibling might not be an 
important attribute in a lot of cases, but when that sibling is Barack 
Obama, then that deserves to be in the info-box (<- how such cases 
could be expressed in a purely template-based approach, we are not 
sure, but it would seem difficult).


We assess the importance of values with PageRank. Assessing the 
importance not only of attributes, but of values, turned out to be a 
major influence on how highly our evaluators assessed the quality of 
the generated info-boxes.


This initial/isolated observation might be interesting since, to the 
best of our understanding, the current wisdom on populating info-boxes 
from Wikidata focuses on what attributes to present and in which 
order, but does not consider the importance of values (aside from the 
Wikidata rank feature, which we believe is more intended to assess 
relevance/timeliness, than importance).


Hence one of the most interesting (and surprising, for us at least) 
results of the work is to suggest that it appears to be important to 
rank *values* by importance (not just attributes) when considering 
what information the user might be interested in.


(There are limitations to PageRank measures, however, in that they 
cannot assess, for example, the importance of a particular date, or, 
more generally, datatype values.)


In any case, we are looking forward to presenting these results at the 
Wiki Workshop at WWW 2018, and any feedback or thoughts are welcome!


Cheers,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Generating info-boxes from Wikidata: the importance of values!

2018-03-08 Thread Aidan Hogan

Hi Gerard,

Yes, this is very much along the lines of what we ultimately ended up 
realising! At first we started out just trying to propose an alternative 
to generate info-boxes with no templates nor info-box resources. Later 
we came to realise that in fact, probably the template approach could 
benefit from some of the "smart features" along the lines of our work 
for the sorts of reasons you outline.


If anyone will be at the Wiki-workshop (WWW) in Lyon in April, we would 
be happy to discuss!


Cheers,
Aidan

On 07-03-2018 3:22, Gerard Meijssen wrote:

Hoi,
Like they say in real estate.. "position, position". The value of your 
research is imho less in presenting static info boxes but more in being 
able to show info boxes about the context of the item involved. You may 
be interested in law professors, senators or presidents and in each case 
you may get presented different information about the same person; in 
your example professor Obama.


It is the same with awards. Consider the George Polk Award a notable 
journalism award.. You can view them from the perspective of the award 
winner but also from the perspective of the publication the awardees 
work(ed) for. The Polk award has "categories" they are not included in 
the Wikidata data yet but they would show awardees in the same category 
in different years.


When you want info boxes and make them static, you have to sit in 
judgement and kill of the "excess" but that may just be what people are 
looking for. When you make them smart, you will be able to provide the 
information that people are likely to be looking for. So please consider 
the smart application of your research.


In these examples we have a lot of information for the items involved. 
There are over 500 Polk Award winners for instance but for many of these 
there is not even an article. With generated info boxes you may be able 
to provide information anyway. It has just one prerequisite; the red 
links are linked to Wikidata.

Thanks,
        GerardM

On 7 March 2018 at 05:53, Aidan Hogan <mailto:aid...@gmail.com>> wrote:


Hi all,

Tomás and I would like to share a paper that might be of interest to
the community. It presents some preliminary results of a work
looking at fully automated methods to generate Wikipedia info-boxes
from Wikidata. The main focus is on deciding what information from
Wikidata to include, and in what order. The results are based on
asking users (students) to rate some prototypes of generated info-boxes.

Tomás Sáez, Aidan Hogan "Automatically Generating Wikipedia
Infoboxes from Wikidata". In the Proceedings of the Wiki Workshop at
WWW 2018, Lyon, France, April 24, 2018.

- Link: http://aidanhogan.com/docs/infobox-wikidata.pdf
<http://aidanhogan.com/docs/infobox-wikidata.pdf>

We understand that populating info-boxes is an important goal of
Wikidata and hence we thought we'd share some lessons learned.

Obviously a lot of work is being put into populating info-boxes from
Wikidata, but the main methods at the moment seem to be
template-based and require a lot of manual labour; plus the
definition of these templates seems to be a difficult problem for
classes such as person (where different information will have
different priorities for people of different professions, notoriety,
etc.).

We were just interested to see how far we could get with a fully
automated approach using some generic ranking methods. Also we
thought that something like this could perhaps be used to generate a
"default" info-box for articles with no info-box and no associated
template mapping. The paper presents preliminary results along those
lines.

One interesting result is that a major factor in the evaluation of
the generated info-boxes was the importance of the value. For
example, Barack Obama has lots of awards, but perhaps only something
like the Nobel Peace Prize might be of relevance to show in the
info-box (<- being intended as an illustrative example rather than a
concrete assertion of course!). Another example is that sibling
might not be an important attribute in a lot of cases, but when that
sibling is Barack Obama, then that deserves to be in the info-box
(<- how such cases could be expressed in a purely template-based
approach, we are not sure, but it would seem difficult).

We assess the importance of values with PageRank. Assessing the
importance not only of attributes, but of values, turned out to be a
major influence on how highly our evaluators assessed the quality of
the generated info-boxes.

This initial/isolated observation might be interesting since, to the
best of our understanding, the current wisdom on populating
info-boxes from Wikidata focuses on what attributes to present and
in which o

Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

2018-03-06 Thread Aidan Hogan

Hi Joachim,

On 14-02-2018 7:32, Neubert, Joachim wrote:

Hi Aidan, hi José,

I'm a bit late - sorry!


Likewise! :)


What came to my mind as an perhaps easy extension: Can or could the browser be 
seeded with an external property (for example P2611, TED speaker ID)?

That would allow to browse some external dataset (e.g., all known TED speakers) 
by the facets provided by Wikidata.


Thanks for the suggestion! While it might seem an easy extension, 
unfortunately that would actually require some significant changes since 
GraFa only considers values that have a label/alias we can auto-complete 
on (which in the case of Wikidata means, for the most part, Q* values).


While it would be great to support datatype/external properties, we 
figured that adding them to the system in a general and clean way would 
not be trivial! We assessed that some such properties require ranges 
(e.g., date-of-birth or height), some require autocomplete (e.g., first 
name), etc. ... and in the case of IDs, it's not clear that these are 
really useful for faceted browsing perhaps since they will jump to a 
specific value. Hence it gets messy to handle in the interface and even 
messier in the back-end.


(A separate issue is that of existential values ... finding entities 
that have some value for a property as your example requires. That would 
require some work, but would be more feasible!)


Best,
Aidan


-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
Aidan Hogan
Gesendet: Donnerstag, 8. Februar 2018 21:33
An: Discussion list for the Wikidata project.
Cc: José Ignacio .
Betreff: Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

Hi all,

On behalf of José and myself, we would really like to thank the people who
tried out our system and gave us feedback!


Some aspects are left to work on (for example, we have not tested for
mobiles, etc.). However, we have made some minor initial changes
reflecting some of the comments we received (adding example text for the
type box, clarifying that the numbers refer to number of results not Q
codes, etc.):

http://grafa.dcc.uchile.cl/


To summarise some aspects of the work and what we've learnt:

* In terms of usability, the principal lesson we have learnt (amongst
many) is that it is not clear for users what is a type. For example,
when searching for "popes born in Poland", the immediate response of
users is to type "pope" rather than "human" or "person" in the type box.
In a future version of the system, we might thus put less emphasis on
starting the search with type (the original reasoning behind this was to
quickly reduce the number of facets/properties that would be shown).
Hence the main conclusion here is to try to avoid interfaces that centre
around "types".


* A major design goal is that the user is only ever shown options that
lead to at least one result. All facets computed are exact with exact
numbers. The technical challenge here is displaying these facets with
exact numbers and values for large result sizes, such as human:

http://grafa.dcc.uchile.cl/search?instance=Q5

This is achieved through caching. We compute all possible queries in the
data that would yield >50,000 results (e.g., human->gender:male,
human->gender:male->country:United States, etc.). We then compute their
facets offline and cache them. In total there's only a couple of hundred
such queries generating that many results. The facets for other queries
with fewer than 50,000 results are computed live. Note that we cannot
cache for keyword queries (instead we just compute facets for the first
50,000 most relevant results). Also, if we add other features such as
range queries or sub-type reasoning, the issue of caching would become
far more complex to handle.


In any case, thanks again to all those who provided feedback! Of course
further comments or questions are welcome (either on- or off-list).
Likewise we will be writing up a paper describing technical aspects of
the system soon with some evaluation results. Once it's ready we will of
course share a link with you.

Best,
Aidan and José


 Forwarded Message 
Subject: Re: GraFa: Faceted browser for RDF/Wikidata [feedback requested]
Date: Mon, 15 Jan 2018 11:47:18 -0300
From: Aidan Hogan 
To: Discussion list for the Wikidata project. 
CC: José Ignacio . 

Hi all!

Just a friendly reminder that tomorrow we will close the questionnaire
so if you have a few minutes to help us out (or are just curious to see
our faceted search system) please see the links and instructions below.

And many thanks to those who have already provided feedback! :)

Best,
José & Aidan

On 09-01-2018 14:18, Aidan Hogan wrote:

Hey all,

A Masters student of mine (José Moreno in CC) has been working on a
faceted navigation system for (large-scale) RDF datasets called "GraFa".

The system i

[Wikidata] Generating info-boxes from Wikidata: the importance of values!

2018-03-06 Thread Aidan Hogan

Hi all,

Tomás and I would like to share a paper that might be of interest to the 
community. It presents some preliminary results of a work looking at 
fully automated methods to generate Wikipedia info-boxes from Wikidata. 
The main focus is on deciding what information from Wikidata to include, 
and in what order. The results are based on asking users (students) to 
rate some prototypes of generated info-boxes.


Tomás Sáez, Aidan Hogan "Automatically Generating Wikipedia Infoboxes 
from Wikidata". In the Proceedings of the Wiki Workshop at WWW 2018, 
Lyon, France, April 24, 2018.


- Link: http://aidanhogan.com/docs/infobox-wikidata.pdf

We understand that populating info-boxes is an important goal of 
Wikidata and hence we thought we'd share some lessons learned.


Obviously a lot of work is being put into populating info-boxes from 
Wikidata, but the main methods at the moment seem to be template-based 
and require a lot of manual labour; plus the definition of these 
templates seems to be a difficult problem for classes such as person 
(where different information will have different priorities for people 
of different professions, notoriety, etc.).


We were just interested to see how far we could get with a fully 
automated approach using some generic ranking methods. Also we thought 
that something like this could perhaps be used to generate a "default" 
info-box for articles with no info-box and no associated template 
mapping. The paper presents preliminary results along those lines.


One interesting result is that a major factor in the evaluation of the 
generated info-boxes was the importance of the value. For example, 
Barack Obama has lots of awards, but perhaps only something like the 
Nobel Peace Prize might be of relevance to show in the info-box (<- 
being intended as an illustrative example rather than a concrete 
assertion of course!). Another example is that sibling might not be an 
important attribute in a lot of cases, but when that sibling is Barack 
Obama, then that deserves to be in the info-box (<- how such cases could 
be expressed in a purely template-based approach, we are not sure, but 
it would seem difficult).


We assess the importance of values with PageRank. Assessing the 
importance not only of attributes, but of values, turned out to be a 
major influence on how highly our evaluators assessed the quality of the 
generated info-boxes.


This initial/isolated observation might be interesting since, to the 
best of our understanding, the current wisdom on populating info-boxes 
from Wikidata focuses on what attributes to present and in which order, 
but does not consider the importance of values (aside from the Wikidata 
rank feature, which we believe is more intended to assess 
relevance/timeliness, than importance).


Hence one of the most interesting (and surprising, for us at least) 
results of the work is to suggest that it appears to be important to 
rank *values* by importance (not just attributes) when considering what 
information the user might be interested in.


(There are limitations to PageRank measures, however, in that they 
cannot assess, for example, the importance of a particular date, or, 
more generally, datatype values.)


In any case, we are looking forward to presenting these results at the 
Wiki Workshop at WWW 2018, and any feedback or thoughts are welcome!


Cheers,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [thanks!]

2018-02-08 Thread Aidan Hogan

Hi all,

On behalf of José and myself, we would really like to thank the people 
who tried out our system and gave us feedback!



Some aspects are left to work on (for example, we have not tested for 
mobiles, etc.). However, we have made some minor initial changes 
reflecting some of the comments we received (adding example text for the 
type box, clarifying that the numbers refer to number of results not Q 
codes, etc.):


http://grafa.dcc.uchile.cl/


To summarise some aspects of the work and what we've learnt:

* In terms of usability, the principal lesson we have learnt (amongst 
many) is that it is not clear for users what is a type. For example, 
when searching for "popes born in Poland", the immediate response of 
users is to type "pope" rather than "human" or "person" in the type box. 
In a future version of the system, we might thus put less emphasis on 
starting the search with type (the original reasoning behind this was to 
quickly reduce the number of facets/properties that would be shown). 
Hence the main conclusion here is to try to avoid interfaces that centre 
around "types".



* A major design goal is that the user is only ever shown options that 
lead to at least one result. All facets computed are exact with exact 
numbers. The technical challenge here is displaying these facets with 
exact numbers and values for large result sizes, such as human:


http://grafa.dcc.uchile.cl/search?instance=Q5

This is achieved through caching. We compute all possible queries in the 
data that would yield >50,000 results (e.g., human->gender:male, 
human->gender:male->country:United States, etc.). We then compute their 
facets offline and cache them. In total there's only a couple of hundred 
such queries generating that many results. The facets for other queries 
with fewer than 50,000 results are computed live. Note that we cannot 
cache for keyword queries (instead we just compute facets for the first 
50,000 most relevant results). Also, if we add other features such as 
range queries or sub-type reasoning, the issue of caching would become 
far more complex to handle.



In any case, thanks again to all those who provided feedback! Of course 
further comments or questions are welcome (either on- or off-list). 
Likewise we will be writing up a paper describing technical aspects of 
the system soon with some evaluation results. Once it's ready we will of 
course share a link with you.


Best,
Aidan and José


 Forwarded Message 
Subject: Re: GraFa: Faceted browser for RDF/Wikidata [feedback requested]
Date: Mon, 15 Jan 2018 11:47:18 -0300
From: Aidan Hogan 
To: Discussion list for the Wikidata project. 
CC: José Ignacio . 

Hi all!

Just a friendly reminder that tomorrow we will close the questionnaire 
so if you have a few minutes to help us out (or are just curious to see 
our faceted search system) please see the links and instructions below.


And many thanks to those who have already provided feedback! :)

Best,
José & Aidan

On 09-01-2018 14:18, Aidan Hogan wrote:

Hey all,

A Masters student of mine (José Moreno in CC) has been working on a 
faceted navigation system for (large-scale) RDF datasets called "GraFa".


The system is available here loaded with a recent version of Wikidata:

http://grafa.dcc.uchile.cl/

Hopefully it is more or less self-explanatory for the moment. :)


If you have a moment to spare, we would hugely appreciate it if you 
could interact with the system for a few minutes and then answer a quick 
questionnaire that should only take a couple more minutes:


https://goo.gl/forms/h07qzn0aNGsRB6ny1

Just for the moment while the questionnaire is open, we would kindly 
request to send feedback to us personally (off-list) to not affect 
others' responses. We will leave the questionnaire open for a week until 
January 16th, 17:00 GMT. After that time of course we would be happy to 
discuss anything you might be interested in on the list. :)


After completing the questionnaire, please also feel free to visit or 
list something you noticed on the Issue Tracker:


https://github.com/joseignm/GraFa/issues


Many thanks,
Aidan and José






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] GraFa: Faceted browser for RDF/Wikidata [feedback requested]

2018-01-15 Thread Aidan Hogan

Hi all!

Just a friendly reminder that tomorrow we will close the questionnaire 
so if you have a few minutes to help us out (or are just curious to see 
our faceted search system) please see the links and instructions below.


And many thanks to those who have already provided feedback! :)

Best,
José & Aidan

On 09-01-2018 14:18, Aidan Hogan wrote:

Hey all,

A Masters student of mine (José Moreno in CC) has been working on a 
faceted navigation system for (large-scale) RDF datasets called "GraFa".


The system is available here loaded with a recent version of Wikidata:

http://grafa.dcc.uchile.cl/

Hopefully it is more or less self-explanatory for the moment. :)


If you have a moment to spare, we would hugely appreciate it if you 
could interact with the system for a few minutes and then answer a quick 
questionnaire that should only take a couple more minutes:


https://goo.gl/forms/h07qzn0aNGsRB6ny1

Just for the moment while the questionnaire is open, we would kindly 
request to send feedback to us personally (off-list) to not affect 
others' responses. We will leave the questionnaire open for a week until 
January 16th, 17:00 GMT. After that time of course we would be happy to 
discuss anything you might be interested in on the list. :)


After completing the questionnaire, please also feel free to visit or 
list something you noticed on the Issue Tracker:


https://github.com/joseignm/GraFa/issues


Many thanks,
Aidan and José






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] GraFa: Faceted browser for RDF/Wikidata [feedback requested]

2018-01-09 Thread Aidan Hogan

Hey all,

A Masters student of mine (José Moreno in CC) has been working on a 
faceted navigation system for (large-scale) RDF datasets called "GraFa".


The system is available here loaded with a recent version of Wikidata:

http://grafa.dcc.uchile.cl/

Hopefully it is more or less self-explanatory for the moment. :)


If you have a moment to spare, we would hugely appreciate it if you 
could interact with the system for a few minutes and then answer a quick 
questionnaire that should only take a couple more minutes:


https://goo.gl/forms/h07qzn0aNGsRB6ny1

Just for the moment while the questionnaire is open, we would kindly 
request to send feedback to us personally (off-list) to not affect 
others' responses. We will leave the questionnaire open for a week until 
January 16th, 17:00 GMT. After that time of course we would be happy to 
discuss anything you might be interested in on the list. :)


After completing the questionnaire, please also feel free to visit or 
list something you noticed on the Issue Tracker:


https://github.com/joseignm/GraFa/issues


Many thanks,
Aidan and José





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Bigdata-developers] Listing named graphs in Wikidata service

2017-04-21 Thread Aidan Hogan

Hi Stas,

On 21-04-2017 15:04, Stas Malyshev wrote:

Hi!


You agree with me that this query :
`select distinct ?g { graph ?g {?s ?p ?o} }` seems to be a valid SPARQL
query, but throws an error in WDQS service [1].


It is a valid SPARQL query, but what you're essentially asking is to
download whole 1.8b triples in a single query. There's no way we can
deliver this in one minute, which is current constraint on queries,
neither we want to enable queries like this - they are very
resource-intensive and serve little purpose.

If you want to work with huge data sets or import the whole data, you
may look either into getting the dump download and processing it with
offline tools like Wikidata Toolkit[2], or using our LDF server[1] which
is lightweight and allows you to work with the data much more efficiently.


While not having a strong opinion on the question of whether or not WDQS 
should support graph queries if it doesn't have named graphs, I disagree 
on the minor technical point that that query requires streaming all 
data. It simply asks for a unique list of all named graphs. Typically 
that will be much smaller than the entire data.


(Of course a naive implementation might stream everything to compute the 
query, but that's another matter I think.)


Cheers,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Student topics of relevance for Wikidata

2017-04-13 Thread Aidan Hogan

This looks great Lydia, thanks!!

The descriptions look like enough for me to catch the idea and explain 
it to a student.


If such a student is interested, we will let you know. :)

Best!
Aidan

On 13-04-2017 12:44, Lydia Pintscher wrote:

On Thu, Apr 13, 2017 at 4:44 PM, Aidan Hogan  wrote:

Hi all,

So at my university the undergraduate students must do a three-month work
towards writing a final short thesis. Generally this work doesn't need to
involve research but should result in a final demonstrable outcome, meaning
a tool, application, something like that.

The students are in Computer Science and have taken various relevant courses
including Semantic Web, Big Data, Data Mining, and so forth.

I was wondering if there was, for example, a list of possible topics
internally within Wikidata ... topics that students could work on with some
guidance here and there from a professor. (Not necessarily research-level
topics, but also implementation or prototyping tasks, perhaps even regarding
something more speculative, or "wouldn't it be nice if we could ..." style
topics?)

If there is no such list, perhaps it might be a good idea to start thinking
about one?

The students I talk with are very interested in doing tasks that could have
real-world impact and I think in this setting, working on something relevant
to the deployment of Wikidata would be a really great experience for them
and hopefully also of benefit to Wikidata.

(And probably there are other professors in a similar context looking for
interesting topics to assign students.)


Hi Aidan,

Thanks for reaching out. Such a list exists:
https://phabricator.wikimedia.org/T90870  However it doesn't make a
lot of sense without some guidance. Some of these topics also already
have someone interested in working on them. It is best to have a quick
call with me to discuss it.


Cheers
Lydia



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Student topics of relevance for Wikidata

2017-04-13 Thread Aidan Hogan

Hi all,

So at my university the undergraduate students must do a three-month 
work towards writing a final short thesis. Generally this work doesn't 
need to involve research but should result in a final demonstrable 
outcome, meaning a tool, application, something like that.


The students are in Computer Science and have taken various relevant 
courses including Semantic Web, Big Data, Data Mining, and so forth.


I was wondering if there was, for example, a list of possible topics 
internally within Wikidata ... topics that students could work on with 
some guidance here and there from a professor. (Not necessarily 
research-level topics, but also implementation or prototyping tasks, 
perhaps even regarding something more speculative, or "wouldn't it be 
nice if we could ..." style topics?)


If there is no such list, perhaps it might be a good idea to start 
thinking about one?


The students I talk with are very interested in doing tasks that could 
have real-world impact and I think in this setting, working on something 
relevant to the deployment of Wikidata would be a really great 
experience for them and hopefully also of benefit to Wikidata.


(And probably there are other professors in a similar context looking 
for interesting topics to assign students.)


Best,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Use of Wikidata as a source for "mainstream" media

2017-03-03 Thread Aidan Hogan



On 03-03-2017 5:38, Markus Kroetzsch wrote:

Hi Aidan,

I recall a discussion regarding the evaluation of occupations of people
mentioned in the Panama Papers. I think this was discussed (critically)
in Le Monde or Le Monde online.


Thanks! For future reference, I found the article here:

http://www.lemonde.fr/panama-papers/article/2016/05/13/panama-papers-pourquoi-cette-infographie-est-bidon_4918997_4890278.html


And there is the Academy Award page by FAZ online:
http://www.faz.net/aktuell/feuilleton/kino/academy-awards-die-oscar-gewinner-auf-einen-blick-12820119.html


Great!

Cheers,
Aidan



On 02.03.2017 21:56, Aidan Hogan wrote:

Hi all,

Is there a list somewhere of instances where the "mainstream" media has
used Wikidata as a source?

I found two thus far:

[1] BuzzFeed checking if 2016 really was a bad year (relatively
speaking) in terms of celebrity deaths.

[2] Le monde checking if 27 is really a dangerous year for artists to be
alive.

Pointers to further instances of Wikidata being used as a source in
mainstream media (or indeed in very prominent applications) would be
greatly appreciated!

Thanks,
Aidan


P.S., I also found a link to a Wiki page with lists of press coverage,
but unfortunately it hasn't been updated in recent years [3]. There I
found some articles about Histopedia, etc., in the Guardian.


[1]
https://www.buzzfeed.com/katiehasty/song-ends-melody-lingers-in-2016?utm_term=.dvV2Dk39L#.itoYqoRVd



[2]
http://www.lemonde.fr/les-decodeurs/visuel/2014/06/06/les-artistes-ont-ils-vraiment-plus-de-risque-de-mourir-a-27-ans_4433903_4355770.html



[3] https://www.wikidata.org/wiki/Wikidata:Press_coverage

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Use of Wikidata as a source for "mainstream" media

2017-03-02 Thread Aidan Hogan

Hi all,

Is there a list somewhere of instances where the "mainstream" media has 
used Wikidata as a source?


I found two thus far:

[1] BuzzFeed checking if 2016 really was a bad year (relatively 
speaking) in terms of celebrity deaths.


[2] Le monde checking if 27 is really a dangerous year for artists to be 
alive.


Pointers to further instances of Wikidata being used as a source in 
mainstream media (or indeed in very prominent applications) would be 
greatly appreciated!


Thanks,
Aidan


P.S., I also found a link to a Wiki page with lists of press coverage, 
but unfortunately it hasn't been updated in recent years [3]. There I 
found some articles about Histopedia, etc., in the Guardian.



[1] 
https://www.buzzfeed.com/katiehasty/song-ends-melody-lingers-in-2016?utm_term=.dvV2Dk39L#.itoYqoRVd


[2] 
http://www.lemonde.fr/les-decodeurs/visuel/2014/06/06/les-artistes-ont-ils-vraiment-plus-de-risque-de-mourir-a-27-ans_4433903_4355770.html


[3] https://www.wikidata.org/wiki/Wikidata:Press_coverage

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-26 Thread Aidan Hogan

On 26-08-2016 16:58, Stas Malyshev wrote:

Hi!


I think in terms of the dump, /replacing/ the Turtle dump with the
N-Triples dump would be a good option. (Not sure if that's what you were
suggesting?)


No, I'm suggesting having both. Turtle is easier to comprehend and also
more compact for download, etc. (though I didn't check how much is the
difference - compressed it may not be that big).


I would argue that human readability is not so important for a dump? For 
dereferenced documents sure, but less so for a dump perhaps.


Also I'd expect when [G|B]Zipped the difference would not justify having 
both (my guess is the N-triples file compressed should end up within 
+25% of the size of the Turtle file compressed, but that's purely a 
guess; obviously worth trying it to see!).


But yep, I get both points.


to have both: existing tools expecting Turtle shouldn't have a problem
with N-Triples.


That depends on whether these tools actually understand RDF - some might
be more simplistic (with text-based formats, you can achieve a lot even
with dumber tools). But that definitely might be an option too. I'm not
sure if it's the best one but a possibility, so we may discuss it too.


I'd imagine that anyone processing Turtle would be using a full-fledged 
Turtle parser? A dumb tool would have to be pretty smart to do anything 
useful with the Turtle I think. And it would not seem wise to parse the 
precise syntax of Turtle that way. But you never know [1]. :)


Of course if providing both is easy, then there's no reason not to 
provide both.



(Also just to put the idea out there of perhaps (also) having N-Quads
where the fourth element indicates the document from which the RDF graph
can be dereferenced. This can be useful for a tool that, e.g., just


What you mean by "document" - like entity? That may be a problem since
some data - like references and values, or property definitions - can be
used by more than one entity. So it's not that trivial to extract all
data regarding one entity from the dump. You can do it via export, e.g.:
http://www.wikidata.org/entity/Q42?flavor=full - but that doesn't
extract it, it just generates it.


If it's problematic, then for sure it can be skipped as a feature. I'm 
mainly just floating the idea.


Perhaps to motivate the feature briefly: we worked a lot for a while on 
a search engine over RDF data ingested from the open Web. Since we were 
ingesting data from the Web, considering one giant RDF graph was not a 
possibility: we needed to keep track of which RDF triples came from 
which Web documents for a variety of reasons. This simple notion of 
provenance was easy to keep track of when we crawled the individual 
documents themselves because we knew what documents we were taking 
triples from. But we could rarely if ever use dumps because they did not 
give such information.


In this view, Wikidata is a website publishing RDF like any other.

It is useful in such applications to know the online RDF documents in 
which a triple can be found. The document could be the entity, or it 
could be a physical location like:


http://www.wikidata.org/entity/Q13794921.ttl

Mainly it needs to be an IRI that can be resolved by HTTP to a document 
containing the triple. Ideally the quads would also cover all triples in 
that document. Even more ideally, the dumps would somehow cover all the 
information that could be obtained from crawling the RDF documents on 
Wikidata, including all HTTP redirects, and so forth.


At the same time I understand this is not a priority and there's 
probably no immediate need for N-Quads or publishing redirects. The need 
for this is rather abstract at the moment so perhaps left until the need 
becomes more concrete.



tl;dr:
N-Triples or N-Triples + Turtle sounds good.
N-Quads would be a bonus if easy to do.

Best,
Aidan

[1] 
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] ntriples dump?

2016-08-26 Thread Aidan Hogan

Hi Stas,

I think in terms of the dump, /replacing/ the Turtle dump with the 
N-Triples dump would be a good option. (Not sure if that's what you were 
suggesting?)


As you already mentioned, N-Triples is easier to process with typical 
unix command-line tools and scripts, etc. But also any (RDF 1.1) 
N-Triples file should be valid Turtle, so I don't see a convincing need 
to have both: existing tools expecting Turtle shouldn't have a problem 
with N-Triples.


(Also just to put the idea out there of perhaps (also) having N-Quads 
where the fourth element indicates the document from which the RDF graph 
can be dereferenced. This can be useful for a tool that, e.g., just 
wants to quickly refresh a single graph from the dump, or more generally 
that wants to keep track of a simple and quick notion of provenance: 
"this triple was found in this Web document".)


Cheers,
Aidan

On 26-08-2016 16:30, Stas Malyshev wrote:

Hi!

I was thinking recently about various data processing scenarios in
wikidata and there's one case we don't have a good coverage for I think.

TLDR: One of the things I think we might do to make it easier to work
with data is having ntriples (line-based) RDF dump format available.

If you need to process a lot of data (like all enwiki sitelinks, etc.)
then the Query Service is not very efficient there, due to limits and
sheer volume of data. We could increase limits but not by much - I don't
think we can allow a 30-minute processing task to hog the resources of
the service to itself. We have some ways to mitigate this, in theory,
but in practice they'll take time to be implemented and deployed.

The other approach would be to do dump processing. Which would work in
most scenarios but the problem is that we have two forms of dump right
now - JSON and TTL (Turtle) and both are not easy to process without
tools with deep understanding of the formats. For JSON, we have Wikidata
Toolkit but it can't ingest RDF/Turtle, and also has some entry barrier
to get everything running even when operation that needs to be done is
trivial.

So I was thinking - what if we had also ntriples RDF dump? The
difference between ntriples and Turtle is that ntriples is line-based
and fully expanded - which means every line can be understood on its own
without needing any context. This enables to process the dump using the
most basic text processing tools or any software that can read a line of
text and apply regexp to it. The downside of ntriples is it's really
verbose, but compression will take care of most of it, and storing
another 10-15G or so should not be a huge deal. Also, current code
already knows how to generate ntriples dump (in fact, almost all unit
tests internally use this format) - we just need to create a job that
actually generates it.

Of course, with right tools you can generate ntriples dump from both
Turtle one and JSON one (Wikidata toolkit can do the latter, IIRC) but
it's one more moving part which makes it harder and introduces potential
for inconsistencies and surprises.

So, what do you think - would having ntriples RDF dump for wikidata help
things?



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Aidan Hogan
, and the other courses in 6 other languages, and CC
Yale OYC, as well as CC WUaS subjects, and with planning for major
universities with these and growing number of wiki subjects in all
languages).

I have no idea yet how to write the SQL/SPARQL for this, but rankable Q*
identifiers, new Q* identifiers and Google would be places I'd begin if
I did. What do you think?

Cheers, Scott



On Sun, Aug 7, 2016 at 2:02 PM, Aidan Hogan mailto:aid...@gmail.com>> wrote:

Hey Scott,

On 07-08-2016 16:15, Info WorldUniversity wrote:

Hi Aidan, Markus, Daniel and Wikidatans,

As an emergence out of this conversation on Wikidata query
performance,
and re cc World University and School/Wikidata, as a theoretical
challenge, how would you suggest coding WUaS/Wikidata initially
to be
able to answer this question - "What are most impt stats issues in
earth/space sci that journalists should understand?" -
https://twitter.com/ReginaNuzzo/status/761179359101259776
<https://twitter.com/ReginaNuzzo/status/761179359101259776> - in
many
Wikipedia languages including however in American Sign Language (and
other sign languages), as well as eventually in voice. (Regina
Nuzzo is
an associate Professor at Gallaudet University for the hearing
impaired/deafness, and has a Ph.D. in statistics from Stanford;
Regina
was born with hearing loss herself).


I fear we are nowhere near answering these sorts of questions (by
we, I mean the computer science community, not just Wikidata). The
main problem is that the question is inherently
ill-defined/subjective: there is no correct answer here.

We would need to think about refining the question to something that
is well-defined/objective, which even as a human is difficult.
Perhaps we could consider a question such as: "what statistical
methods (from a fixed list) have been used in scientific papers
referenced by news articles have been published in the past seven
years by media companies that have their headquarters in the US?".
Of course even then, there are still some minor subjective aspects,
and Wikidata would not have coverage, to answer such a question.

The short answer is that machines are nowhere near answering these
sorts of questions, no more than we are anywhere near taking a raw
stream of binary data from an .mp4 video file and turning it into
visual output. If we want to use machines to do useful things, we
need to meet machines half-way. Part of that is formulating our
questions in a way that machines can hope to process.

I'm excited for when we can ask WUaS (or Wikipedia) this
question, (or
so many others) in voice combining, for example, CC WUaS Statistics,
Earth, Space & Journalism wiki subject pages (with all their CC
MIT OCW
and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects
<http://worlduniversity.wikia.com/wiki/Subjects> - in all
of Wikipedia's 358 languages, again eventually in voice and in
ASL/other
sign languages
(https://twitter.com/WorldUnivAndSch/status/761593842202050560
<https://twitter.com/WorldUnivAndSch/status/761593842202050560>
- see,
too -
https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools
<https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools>).

Thanks for your paper, Aidan, as well. Would designing for deafness
inform how you would approach "Querying Wikidata: Comparing SPARQL,
Relational and Graph Databases" in any new ways?


In the context of Wikidata, the question of language is mostly a
question of interface (which is itself non-trivial). But to answer
the question in whatever language or mode, the question first has to
be answered in some (machine-friendly) language. This is the
direction in which Wikidata goes: answers are first Q* identifiers,
for which labels in different languages can be generated and used to
generate a mode.

Likewise our work is on the level of generating those Q*
identifiers, which can be later turned into tables, maps, sentences,
bubbles, etc. I think the interface question is an important one,
but a different one to that which we tackle.

Cheers,
Aidan


On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch
mailto:markus.kroetz...@tu-dresden.de>
<mailto:markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>>>

wrote:

Hi Aidan,

Thanks, very interesting, though I have not read the details
yet.

I wonder if you have compared the actual query results you
got from
the d

Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Aidan Hogan

Hey Scott,

On 07-08-2016 16:15, Info WorldUniversity wrote:

Hi Aidan, Markus, Daniel and Wikidatans,

As an emergence out of this conversation on Wikidata query performance,
and re cc World University and School/Wikidata, as a theoretical
challenge, how would you suggest coding WUaS/Wikidata initially to be
able to answer this question - "What are most impt stats issues in
earth/space sci that journalists should understand?" -
https://twitter.com/ReginaNuzzo/status/761179359101259776 - in many
Wikipedia languages including however in American Sign Language (and
other sign languages), as well as eventually in voice. (Regina Nuzzo is
an associate Professor at Gallaudet University for the hearing
impaired/deafness, and has a Ph.D. in statistics from Stanford; Regina
was born with hearing loss herself).


I fear we are nowhere near answering these sorts of questions (by we, I 
mean the computer science community, not just Wikidata). The main 
problem is that the question is inherently ill-defined/subjective: there 
is no correct answer here.


We would need to think about refining the question to something that is 
well-defined/objective, which even as a human is difficult. Perhaps we 
could consider a question such as: "what statistical methods (from a 
fixed list) have been used in scientific papers referenced by news 
articles have been published in the past seven years by media companies 
that have their headquarters in the US?". Of course even then, there are 
still some minor subjective aspects, and Wikidata would not have 
coverage, to answer such a question.


The short answer is that machines are nowhere near answering these sorts 
of questions, no more than we are anywhere near taking a raw stream of 
binary data from an .mp4 video file and turning it into visual output. 
If we want to use machines to do useful things, we need to meet machines 
half-way. Part of that is formulating our questions in a way that 
machines can hope to process.



I'm excited for when we can ask WUaS (or Wikipedia) this question, (or
so many others) in voice combining, for example, CC WUaS Statistics,
Earth, Space & Journalism wiki subject pages (with all their CC MIT OCW
and Yale OYC) - http://worlduniversity.wikia.com/wiki/Subjects - in all
of Wikipedia's 358 languages, again eventually in voice and in ASL/other
sign languages
(https://twitter.com/WorldUnivAndSch/status/761593842202050560 - see,
too - https://www.wikidata.org/wiki/Wikidata:Project_chat#Schools).

Thanks for your paper, Aidan, as well. Would designing for deafness
inform how you would approach "Querying Wikidata: Comparing SPARQL,
Relational and Graph Databases" in any new ways?


In the context of Wikidata, the question of language is mostly a 
question of interface (which is itself non-trivial). But to answer the 
question in whatever language or mode, the question first has to be 
answered in some (machine-friendly) language. This is the direction in 
which Wikidata goes: answers are first Q* identifiers, for which labels 
in different languages can be generated and used to generate a mode.


Likewise our work is on the level of generating those Q* identifiers, 
which can be later turned into tables, maps, sentences, bubbles, etc. I 
think the interface question is an important one, but a different one to 
that which we tackle.


Cheers,
Aidan



On Sat, Aug 6, 2016 at 12:29 PM, Markus Kroetzsch
mailto:markus.kroetz...@tu-dresden.de>>
wrote:

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from
the different stores. As far as I know, Neo4J actually uses a very
idiosyncratic query semantics that is neither compatible with SPARQL
(not even on the BGP level) nor with SQL (even for
SELECT-PROJECT-JOIN queries). So it is difficult to compare it to
engines that use SQL or SPARQL (or any other standard query
language, for that matter). In this sense, it may not be meaningful
to benchmark it against such systems.

Regarding Virtuoso, the reason for not picking it for Wikidata was
the lack of load-balancing support in the open source version, not
the performance of a single instance.

Best regards,

Markus



On 06.08.2016 18:19, Aidan Hogan wrote:

Hey all,

Recently we wrote a paper discussing the query performance for
Wikidata,
comparing different possible representations of the
knowledge-base in
Postgres (a relational database), Neo4J (a graph database),
Virtuoso (a
SPARQL database) and BlazeGraph (the SPARQL database currently
in use)
for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the
International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanh

Re: [Wikidata] Wikidata query performance paper

2016-08-07 Thread Aidan Hogan

Hey Daniel,

On 07-08-2016 7:03, Daniel Kinzler wrote:

Hi Aidan!

Thank you for this very interesting research!

Query performance was of course on of the key factors for selecting the
technology to use for the query services. However, it was only one among several
more. The Wikidata use case is different from most common scenarios in some
ways, for instance:

* We cannot optimize for specific queries, since users are free to submit any
query they like.
* The data representation needs to be intuitive enough for (thenically inclined)
casual users to grasp and write queries.
* The data doesn't hold still, it needs to be updated continuously, mutliple
times per second.
* Our data types are more complex than usual - for instance, we suppor tmultiple
calendar models fro dates, and not only values but also different accuracies up
to billions of years; we use "quantities" with unit and uncertainty instead of
plain numbers, etc.

My point is that, if we had a static data set and a handful of known queries to
optimize for, we could have set up a relational or graph database that would be
far more performant than what we have now. The big advantage of Blazegraph is
its felxibility, not raw performance.


Understood. :) Taking everything into account as mentioned above, and 
based on our own experiences with various experiments in the context of 
Wikidata and other works, I think the choice to use RDF/SPARQL was the 
right one (though I would be biased on this issue since I've worked in 
the area for a long time). I guess the more difficult question then, is, 
which RDF/SPARQL implementation to choose (since any such implementation 
should cover as least points 1, 2 and 4 in a similar way), which in turn 
reduces down to the distinguishing questions of performance, licensing, 
distribution, maturity, tech support, development community, and 
non-standard features (keyword search), etc.


Based on raw query performance, based personally on what I have seen, I 
think Virtuoso probably has the lead at the moment in that it has 
consistently outperformed other SPARQL engines, not only in our Wikidata 
experiments, but in other benchmarks by other authors. However, taking 
all the other points into account, particularly in terms of licensing, 
Blazegraph does seem to have been a sound choice. And the current query 
service does seem to be a sound base to work forward from.



It might be interesting to you to know that we initially started to implement
the query service against a graph database, Titan - which was discontinued while
we were still getting up to speed. Luckily this happened early on, it would have
been quite painful to switch after we had gone live.


This is indeed good to know! (We considered other graph database 
engines, but we did not think Gremlin was a good fit with what Wikidata 
was trying to achieve in the sense of being too "imperative": though one 
can indeed do something like bgps with the language, it's not 
particularly easy, nor intuitive.)


Cheers,
Aidan



Am 06.08.2016 um 18:19 schrieb Aidan Hogan:

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata,
comparing different possible representations of the knowledge-base in Postgres
(a relational database), Neo4J (a graph database), Virtuoso (a SPARQL database)
and BlazeGraph (the SPARQL database currently in use) for a set of equivalent
benchmark queries.

The paper was recently accepted for presentation at the International Semantic
Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that perhaps
other engines would perform better on different hardware, or different styles of
queries: for this reason we tried to use the most general types of queries
possible and tried to test different representations in different engines (we
did not vary the hardware). Also in the discussion of results, we tried to give
a more general explanation of the trends, highlighting some strengths/weaknesses
for each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the
technology/techniques needed to query Wikidata.

Cheers,
Aidan


P.S., the paper above is a follow-up to a previous work with Markus Krötzsch
that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching on for
a few years now, I'd like to congratulate the community for making Wikidata what
it is today. It's awesome work. Keep going. :)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata





___

Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Aidan Hogan



On 06-08-2016 18:48, Stas Malyshev wrote:

Hi!


There's a brief summary in the paper of the models used. In terms of all
the "gory" details of how everything was generated, (hopefully) all of
the relevant details supporting the paper should be available here:

http://users.dcc.uchile.cl/~dhernand/wquery/


Yes, the gory part is what I'm after :) Thank you, I'll read through it
in the next couple of days and come back with any questions/comments I
might have.


Okay! :)


We just generalised sitelinks and references as a special type of
qualifier (actually I don't think the paper mentions sitelinks but we
mention this in the context of references).


Sitelinks can not be qualifiers, since they belong to the entity, not to
the statement. They can, I imagine, be considered a special case of
properties (we do not do it, but in theory it is not impossible to
represent them this way if one wanted to).


Ah yes, I think in that context our results should be considered as 
being issued over a "core" of Wikidata in the sense that we did not 
directly consider somevalue, novalue, ranks, etc. (I'm not certain in 
the case of sitelinks; I do not remember discussing those). This is 
indeed all doable in RDF without too much bother (I think) but would be 
much more involved for the relational database or for Neo4J.



I am not sure how exactly one would make references special case of
qualifier, as qualifier has one (maybe complex) value, while references
each can have multiple properties and values, but I'll read through the
details and the code before I talk more about it, it's possible that I
find my answers there.


I guess that depends on what you mean by "automatic" or "manual". :)

Automatic scripts were manually coded to convert from the JSON dump to
each representation. The code is linked above.


Here I meant queries, not data.


Ah, so the query generation process is also described in the 
documentation above. The core idea was to first create "subgraphs" of 
data with the patterns we wanted to generate queries for, and then using 
a certain random process, turn some constants into variables, and then 
select some variables to project. In summary, the queries were 
automatically generated from the data to ensure non-empty results.



I'm not sure I follow on this part, in particular on the part of
"semantic completeness" and why this is hard to achieve in the context
of relational databases. (I get the gist but don't understand enough to
respond directly ... but perhaps below I can answer indirectly?)


Check out https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format

This is the range of data we need to represent and allow people to
query. We found it hard to do this using relational model. It's probably
possible in theory, but producing efficient queries for it looked very
challenging, unless we were essentially to duplicate the effort that is
implemented in any graph database and only use the db itself for most
basic storage needs. That's pretty much what Titan + Cassandra combo
did, which we initially used until Titan's devs were acquired by
DataStax and resulting uncertainty prompted us to look into different
solutions. I imagine in theory it's also possible to create
Something+PostgreSQL combo doing the same, but PostgreSQL looks not enough.


Yes, this is something we did look into in some detail in the sense that 
we had a rather complex relational structure encoding all the features 
mentioned (storing everything from the JSON dumps, essentially), but the 
structure was so complex [1], we decided to simplify and consider the 
final models described in the paper ... especially given the prospect of 
trying to do something similar in Neo4J afterwards. :)



In any case, dealing with things like property paths seem to be rather
hard on SQL-based platform, and practically a must for Wikidata querying.


Yep, agreed.


* Encoding object values with different datatypes (booleans, dates,
etc.) was a pain. One option was to have separate tables/columns for
each datatype, which would complicate queries and also leave the
question of how to add calendars, precisions, etc. Another option was to
use JSON strings to encode the values (the version of Postgres we used
just considered these as strings, but I think the new version has some
JSONB(?) support that could help get around this).


That is also an issue. We have a number of specialty data types (e.g.
dates extending billion years into the future/past, coordinates
including different globes, etc.) which may present a challenge unless
the platform offers an easy way to encode custom types and deal with
them. RDF has rather flexible model (basically string + type IRI) here,
and Blazegraph too, not sure how accommodating the SQL databases would be.

Also, relational DBs mostly prefer very predictable data type model -
i.e. same column always contains the same type. This is obviously not
true for any generic representation, and may be not true even

Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Aidan Hogan

On 06-08-2016 17:56, Stas Malyshev wrote:

Hi!


On a side note, the results we presented for BlazeGraph could improve
dramatically if one could isolate queries that timed out. Once one query
in a sequence timed-out (we used server-side timeouts), we observed that
a run of queries would then timeout, possibly a locking problem or


Could you please give a bit more details about this failure scenario? Is
is that several queries are run in parallel and one query, timing out,
hurts performance of others? Does it happen even after the long query
times out? Or it was a sequential run and after one query timed out, the
next query had worse performance than the same query when run not
preceded by the timing-out query, i.e. timeout query had persistent
effect beyond its initial run?


The latter was the case, yes. We ran the queries in a given batch 
sequentially (waiting for one to finish before the next was run) and 
when one timed out, the next would almost surely time-out and the engine 
would not recover.


We tried a few things on this, like waiting an extra 60 seconds before 
running the next query, and also changing memory configurations to avoid 
GC issues. I believe Daniel was also in contact with the devs. 
Ultimately we figured we probably couldn't resolve the issue without 
touching the source code, which would obviously not be fair.



BTW, what was the timeout setting in your experiments? I see in the
article that it says "timeouts are counted as 60 seconds" - does it mean
that Blazegraph had internal timeout setting set to 60 seconds, or that
the setting was different, but when processing results, the actual run
time was replaced by 60 seconds?


Yup, the settings are here:

http://users.dcc.uchile.cl/~dhernand/wquery/#configure-blazegraph

My understanding is that with those settings, we set an internal timeout 
on BlazeGraph of 60 seconds.



Also, did you use analytic mode for the queries?
https://wiki.blazegraph.com/wiki/index.php/QueryEvaluation#Analytic_Query_Evaluation
https://wiki.blazegraph.com/wiki/index.php/AnalyticQuery

This is the mode that is turned on automatically for the Wikidata Query
Service, and it uses AFAIK different memory management which may
influence how the cases you had problems with are handled.


This I am not aware of. I would have to ask Daniel to be sure (I know he 
spent quite a lot of time playing around with different settings in the 
case of BlazeGraph).



I would appreciate as much detail as you could give on this, as this may
also be useful on current query engine work. Also, if you're interested
in the work done on WDQS, our experiences and the reasons for certain
decisions and setups we did, I'd be glad to answer any questions.


I guess to start with you should have a look at the documentation here:

http://users.dcc.uchile.cl/~dhernand/wquery/

If there's some details missing from that, or if you have any further 
questions, I can put you in contact with Daniel who did all the scripts, 
ran the experiments, was in discussion with the devs, etc. in the 
context of BlazeGraph. (I don't think he's on this list.)


I could also ask him perhaps to try create a minimal-ish test-case that 
reproduces the problem.



resource leak. Also Daniel mentioned that from discussion with the devs,
they claim that the current implementation works best on SSD hard
drives; our experiments were on a standard SATA.


Yes, we run it on SSD, judging from our tests on test servers, running
on virtualized SATA machines, the difference is indeed dramatic (orders
of magnitude and more for some queries). Then again, this is highly
unscientific anecdotal evidence, we didn't make anything resembling
formal benchmarks since the test hardware is clearly inferior to the
production one and is meant to be so. But the point is that SSD is
likely a must for Blazegraph to work well on this data set. Might also
improve results for other engines, so not sure how it influences the
comparison between the engines.


Yes, I think this was the message we got from the mailing lists when we 
were trying to troubleshoot these issues: it would be better to use an 
SSD. But we did not have one, and of course we didn't want to tailor our 
hardware to suit one particular engine.


Unfortunately I think all such empirical experiments are in some sense 
anecdotal; even ours. We cannot deduce, for example, what would happen, 
relatively speaking, on a machine with an SSD, or more cores, or with 
multiple instances. But still, one can learn a lot from good anecdotes.


Cheers,
Aidan

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Aidan Hogan

Hi Stas,

[I'm sorry, I just realised this email was mysteriously sent before it 
was finished. I'll respond in a moment to your other mail.]


On 06-08-2016 17:38, Stas Malyshev wrote:

Hi!


The paper was recently accepted for presentation at the International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf


Thank you for the link!
It would be interesting to see actual data representations used for RDF
(e.g. examples of the data or more detailed description). I notice that
they differ substantially from what we use in the Wikidata Query service
implementation, used with Blazegraph, and also some of the performance
features we have implemented are probably not part of your
implementation. In any case, it would be interesting to know the details
of which RDF representations were used.


There's a brief summary in the paper of the models used. In terms of all 
the "gory" details of how everything was generated, (hopefully) all of 
the relevant details supporting the paper should be available here:


http://users.dcc.uchile.cl/~dhernand/wquery/

The RDF representations are summarised in Figure 2. The code we used to 
generate those representations is mentioned here:


http://users.dcc.uchile.cl/~dhernand/wquery/#download-the-code
http://users.dcc.uchile.cl/~dhernand/wquery/#translate-the-data-to-rdf

Note we did not consider any "direct triples" in the representations 
since we felt this would effectively be "covered" by the Named Graphs 
representation. Rather than mixing direct triples and reified 
representations (like in the current service), we chose to keep them 
separate.



I also note that only statements and qualifiers are mentioned in most of
the text, but very little mention of sitelinks and references. Were they
part of the model too?


We just generalised sitelinks and references as a special type of 
qualifier (actually I don't think the paper mentions sitelinks but we 
mention this in the context of references).



Due to the different RDF semantics, it would be also interesting to get
more details about how the example queries were translated to the RDF
representation(s) used in the article. Was it an automatic process or
they were translated manually? Is it possible to see them?


I guess that depends on what you mean by "automatic" or "manual". :)

Automatic scripts were manually coded to convert from the JSON dump to 
each representation. The code is linked above.


We didn't put the dataset up (since the raw data and the code are 
provided and can be used to generate them and the RDF datasets are 
obviously large) but if you want a copy of the raw RDF data we 
generated, let me know.



When working on Query Service implementation, we considered a number of
possible representations, which regard to both performance and semantic
completeness. One of the conclusions was that achieving adequate
semantic completeness and performance on relational database, while
allowing people to (relatively) easy write complex queries is not
possible, due to relational engines not being a good match for
hierachical graph-like structures in Wikidata.


I'm not sure I follow on this part, in particular on the part of 
"semantic completeness" and why this is hard to achieve in the context 
of relational databases. (I get the gist but don't understand enough to 
respond directly ... but perhaps below I can answer indirectly?)



It would be interesting to look at the Postgres implementation of the
data model and queries to see whether your conclusions were different in
this case.


A sketch of the relational schema is given in Figure 3 of the paper 
(which is not too dissimilar to the Named Graph representation for RDF) 
and some more low level details, including code, etc., in the link 
above, including details on indexing. This was something we admittedly 
has to play around with quite a bit.



Our general experiences of using Postgres were:

* It's very good for simple queries that involve a single join through a 
primary/foreign key (a caveat here: we used the "direct client" of 
Postgres since we could not find a HTTP client like other engines).


* It's not so good when there's a lot of "self-joins" in the query 
(compared with Virtuoso), like for "bushy queries" (or what we call 
"snowflake queries"), or when multiple values for a tuple are given 
(i.e., a single pattern contains multiple constants) but neither on 
their own are particularly selective. We figure that perhaps Virtuoso 
has special optimisations for such self-joins since they would be much 
more common in an RDF/SPARQL scenario than a relational/SQL scenario.


* Encoding object values with different datatypes (booleans, dates, 
etc.) was a pain. One option was to have separate tables/columns for 
each datatype, which would complicate queries and also leave the 
question of how to add calendars, precisions, etc. Another option was to 
use JSON strings 

Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Aidan Hogan

Hey Markus,

On 06-08-2016 15:29, Markus Kroetzsch wrote:

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from the
different stores. As far as I know, Neo4J actually uses a very
idiosyncratic query semantics that is neither compatible with SPARQL
(not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN
queries). So it is difficult to compare it to engines that use SQL or
SPARQL (or any other standard query language, for that matter). In this
sense, it may not be meaningful to benchmark it against such systems.


Yes, SPARQL has a homomorphism-based semantics (where a single result 
can repeat an edge or node an arbitrary amount of times without problem) 
whereas I believe that Neo4J has a sort of 
pseudo-isomorphism-no-repeated-edge semantics in its evaluation (where a 
result cannot reuse the same edge twice, but can match the same node to 
multiple variables). Our queries were generated in such a way that no 
edges would be repeated. We also applied a distinct (set) semantics in 
all cases. For queries that repeat edges, indeed there would be a problem.


In terms of checking answers, we cross-referenced the number of results 
returned in each case. Where there were no errors (exceptions or 
timeouts), the result sizes overall were verified to be almost the same 
(something like 99.99%). The small differences were caused by things 
like BlazeGraph rejecting dates like February 30th that other engines 
didn't. We accepted this as close enough ... as not going to affect the 
performance results.



Our results and experiences were, in general, quite negative with 
respect to using Neo4J at the moment. This was somewhat counter to our 
initial expectations in that we thought that Wikidata would fit 
naturally with the property graph model that Neo4J uses, and also more 
generally in terms of the relative popularity of Neo4J [1].


We encountered a lot of issues, not only in terms of performance, but 
also in terms of indexing and representation (limited support for 
lookups on edge information), query language features (no RPQs: only 
star on simple labels), query planning (poor selectively decisions when 
processing bgps), etc. Our general impression is that Neo4J started with 
a specific use-case in mind (traversing nodes following paths) for which 
it is specialised, but does not currently work well for general basic 
graph pattern matching, and hence does not match well with the Wikidata 
use-case.



Regarding Virtuoso, the reason for not picking it for Wikidata was the
lack of load-balancing support in the open source version, not the
performance of a single instance.


This is good to know! We were admittedly curious about this.

On a side note, the results we presented for BlazeGraph could improve 
dramatically if one could isolate queries that timed out. Once one query 
in a sequence timed-out (we used server-side timeouts), we observed that 
a run of queries would then timeout, possibly a locking problem or 
resource leak. Also Daniel mentioned that from discussion with the devs, 
they claim that the current implementation works best on SSD hard 
drives; our experiments were on a standard SATA.


Best,
Aidan

[1] http://db-engines.com/en/ranking (anecdotal of course)



On 06.08.2016 18:19, Aidan Hogan wrote:

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata,
comparing different possible representations of the knowledge-base in
Postgres (a relational database), Neo4J (a graph database), Virtuoso (a
SPARQL database) and BlazeGraph (the SPARQL database currently in use)
for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that
perhaps other engines would perform better on different hardware, or
different styles of queries: for this reason we tried to use the most
general types of queries possible and tried to test different
representations in different engines (we did not vary the hardware).
Also in the discussion of results, we tried to give a more general
explanation of the trends, highlighting some strengths/weaknesses for
each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the
technology/techniques needed to query Wikidata.

Cheers,
Aidan


P.S., the paper above is a follow-up to a previous work with Markus
Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching
on for a few years now, I'd like to congratulate the communi

[Wikidata] Wikidata query performance paper

2016-08-06 Thread Aidan Hogan

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata, 
comparing different possible representations of the knowledge-base in 
Postgres (a relational database), Neo4J (a graph database), Virtuoso (a 
SPARQL database) and BlazeGraph (the SPARQL database currently in use) 
for a set of equivalent benchmark queries.


The paper was recently accepted for presentation at the International 
Semantic Web Conference (ISWC) 2016. A pre-print is available here:


http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that 
perhaps other engines would perform better on different hardware, or 
different styles of queries: for this reason we tried to use the most 
general types of queries possible and tried to test different 
representations in different engines (we did not vary the hardware). 
Also in the discussion of results, we tried to give a more general 
explanation of the trends, highlighting some strengths/weaknesses for 
each engine independently of the particular queries/data.


I think it's worth a glance for anyone who is interested in the 
technology/techniques needed to query Wikidata.


Cheers,
Aidan


P.S., the paper above is a follow-up to a previous work with Markus 
Krötzsch that focussed purely on RDF/SPARQL:


http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching 
on for a few years now, I'd like to congratulate the community for 
making Wikidata what it is today. It's awesome work. Keep going. :)


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata