Re: [Wikidata-l] WikiData change propagation for third parties

2013-05-04 Thread Jona Christopher Sahnwaldt
On 26 April 2013 17:15, Daniel Kinzler daniel.kinz...@wikimedia.de wrote:
 On 26.04.2013 16:56, Denny Vrandečić wrote:
 The third party propagation is not very high on our priority list. Not 
 because
 it is not important, but because there are things that are even more 
 important -
 like getting it to work for Wikipedia :) And this seems to be stabilizing.

 What we have, for now:

 * We have the broadcast of all edits through IRC.

 This interface is quite unreliable, the output can't be parsed in an 
 unambiguous
 way, and may get truncated. I did implement notifications via XMPP several 
 years
 ago, but it never went beyond a proof of concept. Have a look at the XMLRC
 extension if you are interested.

 * One could poll recent changes, but with 200-450 edits per minute, this 
 might
 get problematic.

 Well, polling isn't really the problem, fetching all the content is. And you'd
 need to do that no matter how you get the information of what has changed.

 * We do have the OAIRepository extension installed on Wikidata. Did anyone 
 try that?

 In principle that is a decent update interface, but I'd recommend not to use 
 OAI
  before we have implemented feature 47714 (Support RDF and API serializations
 of entity data via OAI-MPH). Right now, what you'd get from there would be 
 our
 *internal* JSON representation, which is different from what the API returns,
 and may change at any time without notice.

Somewhat off-topic: I didn't know you have different JSON
representations. I'm curious and I'd be happy about a few quick
answers...

- How many are there? Just the two, internal and external?
- Which JSON representations do the API and the XML dump provide? Will
they do so in the future?
- Are the API and XML dump representations stable? Or should we expect
some changes?

JC


 -- daniel

 --
 Daniel Kinzler, Softwarearchitekt
 Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.


 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-05-04 Thread Daniel Kinzler
On 26.04.2013 21:13, Sebastian Hellmann wrote:
 Hi Daniel,
 
 Am 26.04.2013 18:01, schrieb Daniel Kinzler:
 You guys are the only reason the interface still exists :) DBpedia is the 
 only
 (regular) external user (LuceneSearch is the only internal user). Note that
 there's nobody really maintaining this interface, so finding an alternative
 would be great. Or deciding we (or more precisely, the Wikimedia FOundation -
 there's not much the Wikidata team can do there) really want to support OAI 
 in
 the future, and then overhaul the implementation. -- daniel 
 
 Actually, we asked quite often about where to change to and what would be the
 best way for us to create a live mirror. We just never received an answer...

Yea, that's the issue: the OAI interface is still up, but it's pretty much
unsupported. But there is no alternative as far as I know. It seems like
everyone wants PubHub for this, but as far as I know, nobody is working on it.

I think it's fine for you to keep using OAI for now. Just be aware that once the
WMF moves search to solar, you will be the *only* user of the OAI interface...
so keep in touch with the Foundation about it.

-- daniel


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-05-04 Thread Daniel Kinzler
On 04.05.2013 12:05, Jona Christopher Sahnwaldt wrote:
 On 26 April 2013 17:15, Daniel Kinzler daniel.kinz...@wikimedia.de wrote:
 *internal* JSON representation, which is different from what the API returns,
 and may change at any time without notice.
 
 Somewhat off-topic: I didn't know you have different JSON
 representations. I'm curious and I'd be happy about a few quick
 answers...
 
 - How many are there? Just the two, internal and external?

Yes, these two.

 - Which JSON representations do the API and the XML dump provide? Will
 they do so in the future?

The XML dump provides the internal representations (since it's a dump of the raw
page content). The API uses the external representation.

This is pretty much dictated by the nature of the dumps and the API, so it will
stay that way. However, we plan to add more types of dumps, including:

* a plain JSON dump (using the external representation)
* an RDF/XML dump

It's not sure yet when or even if we'll provide these, but we are considering 
it.

 - Are the API and XML dump representations stable? Or should we expect
 some changes?

The internal representation is unstable and subject to changes without notice.
In fact, it may even change to something other than JSON. I don't think it's
even documented anywhere outside the source code.

The external representation is pretty stable, though not final yet. We will
definitely make additions to it, and some (hopefully minor) structural changes
may be necessary. We'll try to stay largely backwards compatible, but can't
promise full stability yet.

Also, the external representation uses the API framework for generating the
actual JSON, and may be subject to changes imposed by that framework.


Unfortunately, this means that there are currently no dumps with a reliable
representation of our data. You need to a) use the API or b) use the unstable
internal JSON or c) wait for real data dumps.

-- daniel

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-05-04 Thread Jona Christopher Sahnwaldt
On 4 May 2013 17:12, Daniel Kinzler daniel.kinz...@wikimedia.de wrote:
 On 04.05.2013 12:05, Jona Christopher Sahnwaldt wrote:
 On 26 April 2013 17:15, Daniel Kinzler daniel.kinz...@wikimedia.de wrote:
 *internal* JSON representation, which is different from what the API 
 returns,
 and may change at any time without notice.

 Somewhat off-topic: I didn't know you have different JSON
 representations. I'm curious and I'd be happy about a few quick
 answers...

 - How many are there? Just the two, internal and external?

 Yes, these two.

 - Which JSON representations do the API and the XML dump provide? Will
 they do so in the future?

 The XML dump provides the internal representations (since it's a dump of the 
 raw
 page content). The API uses the external representation.

 This is pretty much dictated by the nature of the dumps and the API, so it 
 will
 stay that way. However, we plan to add more types of dumps, including:

 * a plain JSON dump (using the external representation)
 * an RDF/XML dump

 It's not sure yet when or even if we'll provide these, but we are considering 
 it.

 - Are the API and XML dump representations stable? Or should we expect
 some changes?

 The internal representation is unstable and subject to changes without notice.
 In fact, it may even change to something other than JSON. I don't think it's
 even documented anywhere outside the source code.

 The external representation is pretty stable, though not final yet. We will
 definitely make additions to it, and some (hopefully minor) structural changes
 may be necessary. We'll try to stay largely backwards compatible, but can't
 promise full stability yet.

 Also, the external representation uses the API framework for generating the
 actual JSON, and may be subject to changes imposed by that framework.


 Unfortunately, this means that there are currently no dumps with a reliable
 representation of our data. You need to a) use the API or b) use the unstable
 internal JSON or c) wait for real data dumps.

Thanks for the clarification. Not the best news, but not terribly bad either.

We will produce a DBpedia release pretty soon, I don't think we can
wait for the real dumps. The inter-language links are an important
part of DBpedia, so we have to extract data from almost all Wikidata
items. I don't think it's sensible to make ~10 million calls to the
API to download the external JSON format, so we will have to use the
XML dumps and thus the internal format. But I think it's not a big
deal that it's not that stable: we parse the JSON into an AST anyway.
It just means that we will have to use a more abstract AST, which I
was planning to do anyway. As long as the semantics of the internal
format will remain more or less the same - it will contain the labels,
the language links, the properties, etc. - it's no big deal if the
syntax changes, even if it's not JSON anymore.

Christopher


 -- daniel

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-05-04 Thread Daniel Kinzler
On 04.05.2013 19:13, Jona Christopher Sahnwaldt wrote:
 We will produce a DBpedia release pretty soon, I don't think we can
 wait for the real dumps. The inter-language links are an important
 part of DBpedia, so we have to extract data from almost all Wikidata
 items. I don't think it's sensible to make ~10 million calls to the
 API to download the external JSON format, so we will have to use the
 XML dumps and thus the internal format.

Oh, if it's just the language links, this isn't an issue: there's an additional
table for them in the database, and we'll soon be providing a separate dump of
that at table http://dumps.wikimedia.org/wikidatawiki/

If it's not there when you need it, just ask us for a dump of the sitelinks
table (technically, wb_items_per_site), and we'll get you one.

 But I think it's not a big
 deal that it's not that stable: we parse the JSON into an AST anyway.
 It just means that we will have to use a more abstract AST, which I
 was planning to do anyway. As long as the semantics of the internal
 format will remain more or less the same - it will contain the labels,
 the language links, the properties, etc. - it's no big deal if the
 syntax changes, even if it's not JSON anymore.

Yes, if you want the labels and properties in addition to the links, you'll have
to do that for now. But I'm working on the real data dumps.

-- daniel


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-04-26 Thread Sebastian Hellmann

Dear Jeremy,
please read email from Daniel Kinzler on this list from 26.03.2013 18:26 :


* A dispatcher needs about 3 seconds to dispatch 1000 changes to a client wiki.
* Considering we have ~300 client wikis, this means one dispatcher can handle
about 4000 changes per hour.
* We currently have two dispatchers running in parallel (on a single box, hume),
that makes a capacity of 8000 changes/hour.
* We are seeing roughly 17000 changes per hour on wikidata.org - more than twice
our dispatch capacity.
* I want to try running 6 dispatcher processes; that would give us the capacity
to handle 24000 changes per hour (assuming linear scaling).


1.  Somebody needs to run the Hub and it needs to scale. Looks like the 
protocol was intended to save some traffic, not to dispatch a massive 
amount of messages / per day to a large number of clients. Again, I am 
not familiar, how efficient PubSubHubbub is. What kind of hardware is 
needed to run this, effectively? Do you have experience with this?


2. Somebody will still need to run and maintain the Hub and feed all 
clients. I was offering to host one of the hubs for DBpedia users, but I 
am not sure, whether we have that capacity.


So we should use IRC RC + http request to the changed page as fallback?

Sebastian

Am 26.04.2013 08:06, schrieb Jeremy Baron:

  Hi,

On Fri, Apr 26, 2013 at 5:29 AM, Sebastian Hellmann
hellm...@informatik.uni-leipzig.de wrote:

Well, PubSubHubbub is a nice idea. However it clearly depends on two factors:
1. whether Wikidata sets up such an infrastructure (I need to check whether we 
have capacities, I am not sure atm)

Capacity for what? the infrastructure should be not be a problem.
(famous last words, can look more closely tomorrow. but I'm really not
worried about it) And you don't need any infrastructure at all for
development; just use one of google's public instances.


2. whether performance is good enough to handle high-volume publishers

Again, how do you mean?


Basically, polling to recent changes [1] and then do a http request to the 
individual pages should be fine for a start. So I guess this is what we will 
implement, if there aren't any better suggestions.
The whole issue is problematic and the DBpedia project would be happy, if this 
were discussed and decided right now, so we can plan development.

What is the best practice to get updates from Wikipedia at the moment?

I believe just about everyone uses the IRC feed from
irc.wikimedia.org.
https://meta.wikimedia.org/wiki/IRC/Channels#Raw_feeds

I imagine wikidata will or maybe already does propagate changes to a
channel on that server but I can imagine IRC would not be a good
method for many Instant data repo users. Some will not be able to
sustain a single TCP connection for extended periods, some will not be
able to use IRC ports at all, and some may go offline periodically.
e.g. a server on a laptop. AIUI, PubSubHubbub has none of those
problems and is better than the current IRC solution in just about
every way.

We could potentially even replace the current cross-DB job queue
insert crazyness with PubSubHubbub for use on the cluster internally.

-Jeremy

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org

Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-04-26 Thread Dimitris Kontokostas
Hi Daniel,

On Fri, Apr 26, 2013 at 6:15 PM, Daniel Kinzler daniel.kinz...@wikimedia.de
 wrote:

 On 26.04.2013 16:56, Denny Vrandečić wrote:
  The third party propagation is not very high on our priority list. Not
 because
  it is not important, but because there are things that are even more
 important -
  like getting it to work for Wikipedia :) And this seems to be
 stabilizing.
 
  What we have, for now:
 
  * We have the broadcast of all edits through IRC.

 This interface is quite unreliable, the output can't be parsed in an
 unambiguous
 way, and may get truncated. I did implement notifications via XMPP several
 years
 ago, but it never went beyond a proof of concept. Have a look at the XMLRC
 extension if you are interested.

  * One could poll recent changes, but with 200-450 edits per minute, this
 might
  get problematic.

 Well, polling isn't really the problem, fetching all the content is. And
 you'd
 need to do that no matter how you get the information of what has changed.

  * We do have the OAIRepository extension installed on Wikidata. Did
 anyone try that?

 In principle that is a decent update interface, but I'd recommend not to
 use OAI
  before we have implemented feature 47714 (Support RDF and API
 serializations
 of entity data via OAI-MPH). Right now, what you'd get from there would
 be our
 *internal* JSON representation, which is different from what the API
 returns,
 and may change at any time without notice.


What we do right now in DBpedia Live is that we have a local clone of
Wikipedia that get's in sync using the OAIRepository extension. This is
done to abuse our local copy as we please.

The local copy also publishes updates with OAI-PMH that we use to get the
list of modified page ids. Once we get the page ids, we use the normal
mediawiki api to fetch the actual page content.
So, feature 47714 should not be a problem in our case since we don't need
the data serialized directly from OAI-PMH

Cheers,
Dimitris



 -- daniel

 --
 Daniel Kinzler, Softwarearchitekt
 Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.


 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l




-- 
Kontokostas Dimitris
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-04-26 Thread Daniel Kinzler
On 26.04.2013 17:31, Dimitris Kontokostas wrote:
 What we do right now in DBpedia Live is that we have a local clone of 
 Wikipedia
 that get's in sync using the OAIRepository extension. This is done to abuse 
 our
 local copy as we please.

It would be owesome if this Just Worked (tm) for Wikidata too, but i highly
doubt it. You can use the OAI interface to get (unstable) data from Wikidata,
but I don't think magic import from OAI will work. Generally, importing Wikidata
entities into another wiki is problematic, because of entity IDs and uniquenes
constraints. If the target wiki is perfectly in sync, it might work...

Are you going to try this? Would be great if you could give us feedback!

-- daniel
-- 
Daniel Kinzler, Softwarearchitekt
Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.


___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-04-25 Thread Hady elsahar
Hello Dimirtis

what do you thing of that ?
shall i write this part as an abstract part in the proposal and wait for
more details ,
or could we have a smiliar plan like the one already implemented in dbpedia
 http://wiki.dbpedia.org/DBpediaLive#h156-3

thanks
regards


On Fri, Apr 26, 2013 at 12:50 AM, Jeremy Baron jer...@tuxmachine.comwrote:

 On Thu, Apr 25, 2013 at 10:42 PM, Hady elsahar hadyelsa...@gmail.com
 wrote:
  2- is there any design pattern or a  brief outline for the change
 propagation design , how it would be ? in order that i could make a rough
 plan and estimation about how it could be consumed from the DBpedia side ?

 I don't know anything about the plan for this but it seems at first
 glance like a good place to use [[w:PubSubHubbub]].

 -Jeremy

 ___
 Wikidata-l mailing list
 Wikidata-l@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata-l




-- 
-
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile Universityhttp://nileuniversity.edu.eg/

email : hadyelsa...@gmail.com
Phone : +2-01220887311
http://hadyelsahar.me/

http://www.linkedin.com/in/hadyelsahar
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l


Re: [Wikidata-l] WikiData change propagation for third parties

2013-04-25 Thread Sebastian Hellmann
Well, PubSubHubbub is a nice idea. However it clearly depends on two 
factors:
1. whether Wikidata sets up such an infrastructure (I need to check 
whether we have capacities, I am not sure atm)

2. whether performance is good enough to handle high-volume publishers

Basically, polling to recent changes [1] and then do a http request to 
the individual pages should be fine for a start. So I guess this is what 
we will implement, if there aren't any better suggestions.
The whole issue is problematic and the DBpedia project would be happy, 
if this were discussed and decided right now, so we can plan development.


What is the best practice to get updates from Wikipedia at the moment?
We are still using OAI-PMH...

In DBpedia, we use a simple self-created protocol:
http://wiki.dbpedia.org/DBpediaLive#h156-4
/Publication of changesets/: Upon modifications old triples 
are replaced with updated triples. Those added and/or deleted triples 
are also written as N-Triples files and then compressed. Any client 
application or DBpedia-Live mirror can download those files 
and integrate and, hence, update a local copy of DBpedia. This enables 
that application to always in synchronization with our DBpedia-Live. 

This could also work for Wikidata facts, right?


Other useful links:
- http://www.openarchives.org/rs/0.5/resourcesync
- http://www.sdshare.org/
- http://www.w3.org/community/sdshare/
- http://www.rabbitmq.com/


All the best,
Sebastian

[1] 
https://www.wikidata.org/w/index.php?title=Special:RecentChangesfeed=atom


Am 26.04.2013 03:15, schrieb Hady elsahar:

Hello Dimirtis

what do you thing of that ?
shall i write this part as an abstract part in the proposal and wait 
for more details ,
or could we have a smiliar plan like the one already implemented in 
dbpedia http://wiki.dbpedia.org/DBpediaLive#h156-3


thanks
regards


On Fri, Apr 26, 2013 at 12:50 AM, Jeremy Baron jer...@tuxmachine.com 
mailto:jer...@tuxmachine.com wrote:


On Thu, Apr 25, 2013 at 10:42 PM, Hady elsahar
hadyelsa...@gmail.com mailto:hadyelsa...@gmail.com wrote:
 2- is there any design pattern or a  brief outline for the
change propagation design , how it would be ? in order that i
could make a rough plan and estimation about how it could be
consumed from the DBpedia side ?

I don't know anything about the plan for this but it seems at first
glance like a good place to use [[w:PubSubHubbub]].

-Jeremy

___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org mailto:Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l




--
-
Hady El-Sahar
Research Assistant
Center of Informatics Sciences | Nile University 
http://nileuniversity.edu.eg/


email : hadyelsa...@gmail.com mailto:hadyelsa...@gmail.com
Phone : +2-01220887311 tel:%2B2-01220887311
http://hadyelsahar.me/

http://www.linkedin.com/in/hadyelsahar



___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l



--
Dipl. Inf. Sebastian Hellmann
Department of Computer Science, University of Leipzig
Projects: http://nlp2rdf.org , http://linguistics.okfn.org , 
http://dbpedia.org/Wiktionary , http://dbpedia.org

Homepage: http://bis.informatik.uni-leipzig.de/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata-l mailing list
Wikidata-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-l