On Tue, Dec 17, 2019 at 7:16 PM Aidan Hogan <aid...@gmail.com> wrote: > > Hey all, > > As someone who likes to use Wikidata in their research, and likes to > give students projects relating to Wikidata, I am finding it more and > more difficult to (recommend to) work with recent versions of Wikidata > due to the increasing dump sizes, where even the truthy version now > costs considerable time and machine resources to process and handle. In > some cases we just grin and bear the costs, while in other cases we > apply an ad hoc sampling to be able to play around with the data and try > things quickly. > > More generally, I think the growing data volumes might inadvertently > scare people off taking the dumps and using them in their research. > > One idea we had recently to reduce the data size for a student project > while keeping the most notable parts of Wikidata was to only keep claims > that involve an item linked to Wikipedia; in other words, if the > statement involves a Q item (in the "subject" or "object") not linked to > Wikipedia, the statement is removed. > > I wonder would it be possible for Wikidata to provide such a dump to > download (e.g., in RDF) for people who prefer to work with a more > concise sub-graph that still maintains the most "notable" parts? While > of course one could compute this from the full-dump locally, making such > a version available as a dump directly would save clients some > resources, potentially encourage more research using/on Wikidata, and > having such a version "rubber-stamped" by Wikidata would also help to > justify the use of such a dataset for research purposes. > > ... just an idea I thought I would float out there. Perhaps there is > another (better) way to define a concise dump. > > Best, > Aidan
Hi Aiden, That the dumps are becoming too big is an issue I've heard a number of times now. It's something we need to tackle. My biggest issue is deciding how to slice and dice it though in a way that works for many use cases. We have https://phabricator.wikimedia.org/T46581 to brainstorm about that and figure it out. Input from several people very welcome. I also added a link to Benno's tool there. As for the specific suggestion: I fear relying on the existence of sitelinks will kick out a lot of important things you would care about like professions so I'm not sure that's a good thing to offer officially for a larger audience. Cheers Lydia -- Lydia Pintscher - http://about.me/lydia.pintscher Product Manager for Wikidata Wikimedia Deutschland e.V. Tempelhofer Ufer 23-24 10963 Berlin www.wikimedia.de Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207. _______________________________________________ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata