Markus, some simple solution to the problem can be an addition of id range
parameter to SMW_refreshData - not only first id to start with, but a last
one as well - this way it'll be easy to split the whole task into smaller
chunks to avoid memory leaks and to (possibly) run them in parallel. I was
thinking about writing a patch like that, but not sure when I'll get back to
this problem.

P.S. MediaWiki I'm running this on is from latest branch, e.g. 1.11.x

           Sergey


On Nov 24, 2007 8:30 AM, Markus Krötzsch <[EMAIL PROTECTED]> wrote:

> On Dienstag, 6. November 2007, Sergey Chernyshev wrote:
> > It seems that SMW_refreshData gets slower with growing size of the
> dataset.
> >
> > I didn't do much of troubleshooting of the issue, but first 50000 pages
> > from my dataset were processed faster then second 50000 pages.
>
> I noticed the same on our servers, and I suspect some memory leak to
> account
> for that. It is possible that MediaWiki is part of the reason -- we had a
> similar problem some time ago and it turned out that MediaWiki's
> link-cache
> had no size limit (so batch-processing 1Mio pages really generated a large
> array in memory). Similar caches may be the reason for the renewed
> slowdown,
> but we were unable to analyse this issue in detail. Anyway, the MW version
> is
> an important part of debugging here.
>
> >
> > I'm going to start upgrade over for RC2 and will try to look at it in
> terms
> > of speed of the process, but I think there might be a reason for it in
> some
> > indexes getting bigger with more data (which can be avoided by dropping
> > indexes prior to refresh and rebuilding them right after) or MySQL not
> > liking that many temporary tables created so rapidly.
>
> I would rather suspect the PHP side to be the reason, but on enever knows.
> I
> do not expect changes between SMW1.0-RCs. Basically the refresh process
> did
> not change much for a long time, but the speed issues only occurred
> recently
> (again suggesting that some change in MW may be the reason). SMW also has
> some unbound caches, but these are for properties and should hardly get
> large
> enough on current wikis to be relevant here.
>
> >
> > Also, I'm wondering if parts of the dataset can be processed in
> parallel?
> > it seems that single run of the script doesn't load CPU that much and
> > alternates between PHP and MySQL processes which is not optimal for
> > multi-processor boxes where these loads can be spread across all the
> CPUs.
>
> Possibly, but refreshing often is a low-priority task since the wiki
> should be
> usable during refreshing. So it might be an advantage if it works in the
> background without eating too much resources at a time (which by the above
> observation is probably not really the case either ;-).
>
>
> Markus
>
> >
> >          Sergey
>
>
>
> --
> Markus Krötzsch
> Institut AIFB, Universät Karlsruhe (TH), 76128 Karlsruhe
> phone +49 (0)721 608 7362        fax +49 (0)721 608 5998
> [EMAIL PROTECTED]        www  http://korrekt.org
>
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Semediawiki-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel

Reply via email to