Re: [SMW-devel] Semantic MediaWiki and Parser Function Initialization

Markus Krötzsch Fri, 15 Aug 2008 06:49:24 -0700

On Freitag, 15. August 2008, Daniel Friesen wrote:
> Sub parsers. In what kind of case does this kind of thing happen for you?

Normally, sub-parses happen with (a clone of) the current parser, e.g. when 
using a <gallery>. But I am not aware of any guideline that states that 
extensions are not allowed to create or clone new parser objects and use them 
with any title they like. So anything could happen.

>
> When one thing is being parsed, there is one parser doing that task. I
> don't know of many cases where multiple parsers exist (unless an
> extension is doing something screwy).

We have observed the use of multiple parsers or of one parser with multiple 
title objects (this distinction is not really relevant for us) in between SMW 
calls on various wikis. We use hooks during parsing to set the title of the 
page that is currently processed, so we notice when titles change and we have 
to reset the data (in a long PHP run there might be many titles that are 
processed, and there is no guarantee that some save-hook is called before the 
next page starts processing). 

Initially in 1.2, we did just reset the data and title once during parsing, 
and not all hooks did set the title again. This has lead to mean bugs, where 
data was stored for the wrong title (we even had annotated special pages in 
one case!). Since the title for storing was only set within hooks of the 
parser (using getTitle() of the supplied parser), the only explanation is 
that some other parser fired those hooks with a different title object being 
parsed, and that this happened before we saved the current data to the DB.

Now we make sure that each and every hook call first sets the proper current 
title and only later saves data of any kind. In this way it is at least 
ensured that no data ever ends up in the wrong title, but data can still be 
lost. Again it happened that titles changed between parsing and storing 
(leading to losses of data, since the change of title also lead to clearing 
the internal data buffer). So we now use a second buffer to store the data 
already parsed for the *previous* title, just in case it turns out that the 
next saving method actually wants to save this data! But this is just a hack: 
we are blindly moving from hook to hook, parsing data here and there and not 
knowing for which cases there will be a save-hook later on. It is all very 
frustrating, and race conditions are still possible.

Even now we still experience cases where apparently random data is lost when 
we create update jobs for all pages: some pages just loose their properties, 
but these are different pages each time we try. And of course this affects at 
most 10 pages each time on a densely annotated wiki with 7000 articles 
(semanticweb.org).

With your report this morning, I also removed setting the title in  
ParserBeforeStrip. Maybe this reduces the amount of wrongly set titles.

>
> Have you tried making use of the ParserOutput? That seams like a to the
> point thing, there should only be one of those for a parse.

I did not really find a way yet to use it properly. Can it hold additional 
data somewhere?

Not only the semantic data, but also other "globals" are affected by similar 
problems. We use globals to add CSS and JavaScripts to pages based on whether 
they were needed on a page. It turned out that jobs are executed when viewing 
a special page in between the time when the page is parsed and when the 
output HTML is created. Hence any job would actually have to capture the 
current globals and reset them after using any parsing, or otherwise the 
job's parsers will "use up" the script data needed by the special page. 
Again, one could add further protection to make sure scripts are 
only "consumed" by the page that created them, but these are all just 
workarounds for the basic problem: if you need to preserve data between 
hooks, how can you make sure that the data is not stored for ever and still 
remains available long enough until you need it?

-- Markus

>
> ~Daniel Friesen(Dantman, Nadir-Seen-Fire) of:
> -The Nadir-Point Group (http://nadir-point.com)
> --It's Wiki-Tools subgroup (http://wiki-tools.com)
> --The ElectronicMe project (http://electronic-me.org)
> --Games-G.P.S. (http://ggps.org)
> -And Wikia ACG on Wikia.com (http://wikia.com/wiki/Wikia_ACG)
> --Animepedia (http://anime.wikia.com)
> --Narutopedia (http://naruto.wikia.com)
>
> Markus Krötzsch wrote:
> > Hi Daniel,
> >
> > it's always refreshing to get some thorough code critique from you in the
> > morning -- thanks for caring! I have added you to our contributors' list,
> > and I would much appreciate your ideas on some further hacks that I am
> > well aware of, see below.
> >
> >> Anyone want to explain to me why the ParserBeforeStrip hook is being
> >> used to register parser functions?
> >
> > In defence of my code: it works. Up to the introductions of
> > ParserFirstCallInit it was also one of the few hooks that got reliably
> > (at least in my experience) called before any parser function would be
> > needed.
> >
> >> That is a poor place for it, as well as unreliable. Which I can see by
> >> how the function being called is a major hack relying on the first call
> >> returning the callback name when already set..
> >
> > Well, I have seen worse hacks (only part of which were in my code, but
> > see the remarks below on a major problem I still see there). But point
> > taken for this issue too.
> >
> >> Since I took the liberty of fixing up Semantic Forms, please see it as a
> >> reference on how to correctly add Parser Functions to the parser:
> >> http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/SemanticForms
> >>/in cludes/SF_ParserFunctions.php?view=markup
> >
> > Great, I added similar code to SMW now.
> >
> >
> > To stay with this topic, I feel that the whole parser hooking business is
> > bound to be one large hack. As a parser extension that stores data, you
> > need to hook to several places in MW, hoping that they are somehow called
> > in the expected order and that nobody overwrites your data in between
> > hooks. We have to store the parsed data somewhere, and this place needs
> > to be globally accessible since the parser offers no local storage to us
> > (none that would not be cloned with unrelated subparsers anyway). But
> > parsing is not global and happens in many parsers, or in many, possibly
> > nested, runs of one parser. The current code has evolved to prevent many
> > problems that this creates, but it lacks a unified approach towards
> > handling this situation.
> >
> > Many things can still go wrong. There is no way of finding out whether we
> > run in the main parsing method of a wiki page text, or if we are just
> > called on some page footer or sub-parsing action triggered by some
> > extension. Jobs and extensions cross-fire with their own parsing calls,
> > often using different Title objects.
> >
> > Do you have any insights on how to improve the runtime data management in
> > SMW so that we can collect data belonging to one article in multiple
> > hooks, not have it overwritten by other sub-hooks, and still do not get
> > memory leaks on very long runs? We cannot keep all data indefinitely just
> > because we are unsure whether we are still in a sub-parser and need the
> > data later on. But if we only store the *current* data, we need to find
> > out what title actually is currently parsed with the goal of storing or
> > updating its data in the DB.
> >
> >
> > Best regards,
> >
> > Markus
> >
> >
> >
> > ------------------------------------------------------------------------
> >
> > -------------------------------------------------------------------------
> > This SF.Net email is sponsored by the Moblin Your Move Developer's
> > challenge Build the coolest Linux based applications with Moblin SDK &
> > win great prizes Grand prize is a trip for two to an Open Source event
> > anywhere in the world
> > http://moblin-contest.org/redirect.php?banner_id=100&url=/
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > Semediawiki-devel mailing list
> > Semediawiki-devel@lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/semediawiki-devel

-- 
Markus Krötzsch
Semantic MediaWiki    http://semantic-mediawiki.org
http://korrekt.org    [EMAIL PROTECTED]

signature.asc
Description: This is a digitally signed message part.

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
Semediawiki-devel mailing list
Semediawiki-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel

Re: [SMW-devel] Semantic MediaWiki and Parser Function Initialization

Reply via email to