Re: [Wikidata-l] WikiData change propagation for third parties
On 26 April 2013 17:15, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 26.04.2013 16:56, Denny Vrandečić wrote: The third party propagation is not very high on our priority list. Not because it is not important, but because there are things that are even more important - like getting it to work for Wikipedia :) And this seems to be stabilizing. What we have, for now: * We have the broadcast of all edits through IRC. This interface is quite unreliable, the output can't be parsed in an unambiguous way, and may get truncated. I did implement notifications via XMPP several years ago, but it never went beyond a proof of concept. Have a look at the XMLRC extension if you are interested. * One could poll recent changes, but with 200-450 edits per minute, this might get problematic. Well, polling isn't really the problem, fetching all the content is. And you'd need to do that no matter how you get the information of what has changed. * We do have the OAIRepository extension installed on Wikidata. Did anyone try that? In principle that is a decent update interface, but I'd recommend not to use OAI before we have implemented feature 47714 (Support RDF and API serializations of entity data via OAI-MPH). Right now, what you'd get from there would be our *internal* JSON representation, which is different from what the API returns, and may change at any time without notice. Somewhat off-topic: I didn't know you have different JSON representations. I'm curious and I'd be happy about a few quick answers... - How many are there? Just the two, internal and external? - Which JSON representations do the API and the XML dump provide? Will they do so in the future? - Are the API and XML dump representations stable? Or should we expect some changes? JC -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
On 26.04.2013 21:13, Sebastian Hellmann wrote: Hi Daniel, Am 26.04.2013 18:01, schrieb Daniel Kinzler: You guys are the only reason the interface still exists :) DBpedia is the only (regular) external user (LuceneSearch is the only internal user). Note that there's nobody really maintaining this interface, so finding an alternative would be great. Or deciding we (or more precisely, the Wikimedia FOundation - there's not much the Wikidata team can do there) really want to support OAI in the future, and then overhaul the implementation. -- daniel Actually, we asked quite often about where to change to and what would be the best way for us to create a live mirror. We just never received an answer... Yea, that's the issue: the OAI interface is still up, but it's pretty much unsupported. But there is no alternative as far as I know. It seems like everyone wants PubHub for this, but as far as I know, nobody is working on it. I think it's fine for you to keep using OAI for now. Just be aware that once the WMF moves search to solar, you will be the *only* user of the OAI interface... so keep in touch with the Foundation about it. -- daniel ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
On 04.05.2013 12:05, Jona Christopher Sahnwaldt wrote: On 26 April 2013 17:15, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: *internal* JSON representation, which is different from what the API returns, and may change at any time without notice. Somewhat off-topic: I didn't know you have different JSON representations. I'm curious and I'd be happy about a few quick answers... - How many are there? Just the two, internal and external? Yes, these two. - Which JSON representations do the API and the XML dump provide? Will they do so in the future? The XML dump provides the internal representations (since it's a dump of the raw page content). The API uses the external representation. This is pretty much dictated by the nature of the dumps and the API, so it will stay that way. However, we plan to add more types of dumps, including: * a plain JSON dump (using the external representation) * an RDF/XML dump It's not sure yet when or even if we'll provide these, but we are considering it. - Are the API and XML dump representations stable? Or should we expect some changes? The internal representation is unstable and subject to changes without notice. In fact, it may even change to something other than JSON. I don't think it's even documented anywhere outside the source code. The external representation is pretty stable, though not final yet. We will definitely make additions to it, and some (hopefully minor) structural changes may be necessary. We'll try to stay largely backwards compatible, but can't promise full stability yet. Also, the external representation uses the API framework for generating the actual JSON, and may be subject to changes imposed by that framework. Unfortunately, this means that there are currently no dumps with a reliable representation of our data. You need to a) use the API or b) use the unstable internal JSON or c) wait for real data dumps. -- daniel ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
On 4 May 2013 17:12, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 04.05.2013 12:05, Jona Christopher Sahnwaldt wrote: On 26 April 2013 17:15, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: *internal* JSON representation, which is different from what the API returns, and may change at any time without notice. Somewhat off-topic: I didn't know you have different JSON representations. I'm curious and I'd be happy about a few quick answers... - How many are there? Just the two, internal and external? Yes, these two. - Which JSON representations do the API and the XML dump provide? Will they do so in the future? The XML dump provides the internal representations (since it's a dump of the raw page content). The API uses the external representation. This is pretty much dictated by the nature of the dumps and the API, so it will stay that way. However, we plan to add more types of dumps, including: * a plain JSON dump (using the external representation) * an RDF/XML dump It's not sure yet when or even if we'll provide these, but we are considering it. - Are the API and XML dump representations stable? Or should we expect some changes? The internal representation is unstable and subject to changes without notice. In fact, it may even change to something other than JSON. I don't think it's even documented anywhere outside the source code. The external representation is pretty stable, though not final yet. We will definitely make additions to it, and some (hopefully minor) structural changes may be necessary. We'll try to stay largely backwards compatible, but can't promise full stability yet. Also, the external representation uses the API framework for generating the actual JSON, and may be subject to changes imposed by that framework. Unfortunately, this means that there are currently no dumps with a reliable representation of our data. You need to a) use the API or b) use the unstable internal JSON or c) wait for real data dumps. Thanks for the clarification. Not the best news, but not terribly bad either. We will produce a DBpedia release pretty soon, I don't think we can wait for the real dumps. The inter-language links are an important part of DBpedia, so we have to extract data from almost all Wikidata items. I don't think it's sensible to make ~10 million calls to the API to download the external JSON format, so we will have to use the XML dumps and thus the internal format. But I think it's not a big deal that it's not that stable: we parse the JSON into an AST anyway. It just means that we will have to use a more abstract AST, which I was planning to do anyway. As long as the semantics of the internal format will remain more or less the same - it will contain the labels, the language links, the properties, etc. - it's no big deal if the syntax changes, even if it's not JSON anymore. Christopher -- daniel ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
On 04.05.2013 19:13, Jona Christopher Sahnwaldt wrote: We will produce a DBpedia release pretty soon, I don't think we can wait for the real dumps. The inter-language links are an important part of DBpedia, so we have to extract data from almost all Wikidata items. I don't think it's sensible to make ~10 million calls to the API to download the external JSON format, so we will have to use the XML dumps and thus the internal format. Oh, if it's just the language links, this isn't an issue: there's an additional table for them in the database, and we'll soon be providing a separate dump of that at table http://dumps.wikimedia.org/wikidatawiki/ If it's not there when you need it, just ask us for a dump of the sitelinks table (technically, wb_items_per_site), and we'll get you one. But I think it's not a big deal that it's not that stable: we parse the JSON into an AST anyway. It just means that we will have to use a more abstract AST, which I was planning to do anyway. As long as the semantics of the internal format will remain more or less the same - it will contain the labels, the language links, the properties, etc. - it's no big deal if the syntax changes, even if it's not JSON anymore. Yes, if you want the labels and properties in addition to the links, you'll have to do that for now. But I'm working on the real data dumps. -- daniel ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
Dear Jeremy, please read email from Daniel Kinzler on this list from 26.03.2013 18:26 : * A dispatcher needs about 3 seconds to dispatch 1000 changes to a client wiki. * Considering we have ~300 client wikis, this means one dispatcher can handle about 4000 changes per hour. * We currently have two dispatchers running in parallel (on a single box, hume), that makes a capacity of 8000 changes/hour. * We are seeing roughly 17000 changes per hour on wikidata.org - more than twice our dispatch capacity. * I want to try running 6 dispatcher processes; that would give us the capacity to handle 24000 changes per hour (assuming linear scaling). 1. Somebody needs to run the Hub and it needs to scale. Looks like the protocol was intended to save some traffic, not to dispatch a massive amount of messages / per day to a large number of clients. Again, I am not familiar, how efficient PubSubHubbub is. What kind of hardware is needed to run this, effectively? Do you have experience with this? 2. Somebody will still need to run and maintain the Hub and feed all clients. I was offering to host one of the hubs for DBpedia users, but I am not sure, whether we have that capacity. So we should use IRC RC + http request to the changed page as fallback? Sebastian Am 26.04.2013 08:06, schrieb Jeremy Baron: Hi, On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann hellm...@informatik.uni-leipzig.de wrote: Well, PubSubHubbub is a nice idea. However it clearly depends on two factors: 1. whether Wikidata sets up such an infrastructure (I need to check whether we have capacities, I am not sure atm) Capacity for what? the infrastructure should be not be a problem. (famous last words, can look more closely tomorrow. but I'm really not worried about it) And you don't need any infrastructure at all for development; just use one of google's public instances. 2. whether performance is good enough to handle high-volume publishers Again, how do you mean? Basically, polling to recent changes [1] and then do a http request to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development. What is the best practice to get updates from Wikipedia at the moment? I believe just about everyone uses the IRC feed from irc.wikimedia.org. https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds I imagine wikidata will or maybe already does propagate changes to a channel on that server but I can imagine IRC would not be a good method for many Instant data repo users. Some will not be able to sustain a single TCP connection for extended periods, some will not be able to use IRC ports at all, and some may go offline periodically. e.g. a server on a laptop. AIUI, PubSubHubbub has none of those problems and is better than the current IRC solution in just about every way. We could potentially even replace the current cross-DB job queue insert crazyness with PubSubHubbub for use on the cluster internally. -Jeremy ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
Hi Daniel, On Fri, Apr 26, 2013 at 6:15 PM, Daniel Kinzler daniel.kinz...@wikimedia.de wrote: On 26.04.2013 16:56, Denny Vrandečić wrote: The third party propagation is not very high on our priority list. Not because it is not important, but because there are things that are even more important - like getting it to work for Wikipedia :) And this seems to be stabilizing. What we have, for now: * We have the broadcast of all edits through IRC. This interface is quite unreliable, the output can't be parsed in an unambiguous way, and may get truncated. I did implement notifications via XMPP several years ago, but it never went beyond a proof of concept. Have a look at the XMLRC extension if you are interested. * One could poll recent changes, but with 200-450 edits per minute, this might get problematic. Well, polling isn't really the problem, fetching all the content is. And you'd need to do that no matter how you get the information of what has changed. * We do have the OAIRepository extension installed on Wikidata. Did anyone try that? In principle that is a decent update interface, but I'd recommend not to use OAI before we have implemented feature 47714 (Support RDF and API serializations of entity data via OAI-MPH). Right now, what you'd get from there would be our *internal* JSON representation, which is different from what the API returns, and may change at any time without notice. What we do right now in DBpedia Live is that we have a local clone of Wikipedia that get's in sync using the OAIRepository extension. This is done to abuse our local copy as we please. The local copy also publishes updates with OAI-PMH that we use to get the list of modified page ids. Once we get the page ids, we use the normal mediawiki api to fetch the actual page content. So, feature 47714 should not be a problem in our case since we don't need the data serialized directly from OAI-PMH Cheers, Dimitris -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Kontokostas Dimitris ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
On 26.04.2013 17:31, Dimitris Kontokostas wrote: What we do right now in DBpedia Live is that we have a local clone of Wikipedia that get's in sync using the OAIRepository extension. This is done to abuse our local copy as we please. It would be owesome if this Just Worked (tm) for Wikidata too, but i highly doubt it. You can use the OAI interface to get (unstable) data from Wikidata, but I don't think magic import from OAI will work. Generally, importing Wikidata entities into another wiki is problematic, because of entity IDs and uniquenes constraints. If the target wiki is perfectly in sync, it might work... Are you going to try this? Would be great if you could give us feedback! -- daniel -- Daniel Kinzler, Softwarearchitekt Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
Hello Dimirtis what do you thing of that ? shall i write this part as an abstract part in the proposal and wait for more details , or could we have a smiliar plan like the one already implemented in dbpedia http://wiki.dbpedia.org/DBpediaLive#h156-3 thanks regards On Fri, Apr 26, 2013 at 12:50 AM, Jeremy Baron jer...@tuxmachine.comwrote: On Thu, Apr 25, 2013 at 10:42 PM, Hady elsahar hadyelsa...@gmail.com wrote: 2- is there any design pattern or a brief outline for the change propagation design , how it would be ? in order that i could make a rough plan and estimation about how it could be consumed from the DBpedia side ? I don't know anything about the plan for this but it seems at first glance like a good place to use [[w:PubSubHubbub]]. -Jeremy ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- - Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile Universityhttp://nileuniversity.edu.eg/ email : hadyelsa...@gmail.com Phone : +2-01220887311 http://hadyelsahar.me/ http://www.linkedin.com/in/hadyelsahar ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l
Re: [Wikidata-l] WikiData change propagation for third parties
Well, PubSubHubbub is a nice idea. However it clearly depends on two factors: 1. whether Wikidata sets up such an infrastructure (I need to check whether we have capacities, I am not sure atm) 2. whether performance is good enough to handle high-volume publishers Basically, polling to recent changes [1] and then do a http request to the individual pages should be fine for a start. So I guess this is what we will implement, if there aren't any better suggestions. The whole issue is problematic and the DBpedia project would be happy, if this were discussed and decided right now, so we can plan development. What is the best practice to get updates from Wikipedia at the moment? We are still using OAI-PMH... In DBpedia, we use a simple self-created protocol: http://wiki.dbpedia.org/DBpediaLive#h156-4 /Publication of changesets/: Upon modifications old triples are replaced with updated triples. Those added and/or deleted triples are also written as N-Triples files and then compressed. Any client application or DBpedia-Live mirror can download those files and integrate and, hence, update a local copy of DBpedia. This enables that application to always in synchronization with our DBpedia-Live. This could also work for Wikidata facts, right? Other useful links: - http://www.openarchives.org/rs/0.5/resourcesync - http://www.sdshare.org/ - http://www.w3.org/community/sdshare/ - http://www.rabbitmq.com/ All the best, Sebastian [1] https://www.wikidata.org/w/index.php?title=Special:RecentChangesfeed=atom Am 26.04.2013 03:15, schrieb Hady elsahar: Hello Dimirtis what do you thing of that ? shall i write this part as an abstract part in the proposal and wait for more details , or could we have a smiliar plan like the one already implemented in dbpedia http://wiki.dbpedia.org/DBpediaLive#h156-3 thanks regards On Fri, Apr 26, 2013 at 12:50 AM, Jeremy Baron jer...@tuxmachine.com mailto:jer...@tuxmachine.com wrote: On Thu, Apr 25, 2013 at 10:42 PM, Hady elsahar hadyelsa...@gmail.com mailto:hadyelsa...@gmail.com wrote: 2- is there any design pattern or a brief outline for the change propagation design , how it would be ? in order that i could make a rough plan and estimation about how it could be consumed from the DBpedia side ? I don't know anything about the plan for this but it seems at first glance like a good place to use [[w:PubSubHubbub]]. -Jeremy ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- - Hady El-Sahar Research Assistant Center of Informatics Sciences | Nile University http://nileuniversity.edu.eg/ email : hadyelsa...@gmail.com mailto:hadyelsa...@gmail.com Phone : +2-01220887311 tel:%2B2-01220887311 http://hadyelsahar.me/ http://www.linkedin.com/in/hadyelsahar ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l -- Dipl. Inf. Sebastian Hellmann Department of Computer Science, University of Leipzig Projects: http://nlp2rdf.org , http://linguistics.okfn.org , http://dbpedia.org/Wiktionary , http://dbpedia.org Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann Research Group: http://aksw.org ___ Wikidata-l mailing list Wikidata-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-l