Re: [SMW-devel] Performance: (Was: {{#ask}})

Markus Krötzsch Sat, 29 Dec 2007 07:43:08 -0800

On Montag, 17. Dezember 2007, Sergey Chernyshev wrote:
> Thank you, Markus - it's a really good review! I wonder if there is any way
> to unify performance reporting for all SMW instances so we can compare the
> effects of large data sets, different systems configs (e.g. disabled cache
> and so on) - just looked at profileinfo.php script, it might be an answer,
> actually.
>
> I wonder if real Wikipedia set of data (outdated, maybe) is going to be set
> up as a test-case for SMW to handle (with Semantic Templates, of course) -
> I was going to do that, but don't have resources for this. This might help
> to make the goal of "Semantic Wikipedia" more transparent.


In fact we have such a site, but it runs on a rather unstable hardware (we 
have a buggy RAID controller or driver :-(). It is our test server at 
test.ontoworld.org, which also was used for other experiments and is not in 
perfect shape right now (and querying was disabled in order to not impair 
other experiments). We might set up another more recent Wikipedia copy 
sometime in some future.

>
> Since we're talking about performance, there is another side of performance
> tuning - percepted performance, this mostly concerns javascripts, css and
> so on - for example there is still a problem of SIMILE Timeline not being
> that fast to load (although performance of pages that didn't have it
> improved now, when client-side the code is loaded only on pages that need
> it). This kind of issues can be tracked using Firebog with Yahoo's YSlow
> add-on.

True, and I hope Timeline is really the main performance problem there. I 
wonder whether we could ship a more stripped down version of the scripts to 
decrease load time. I guess we should ask the guys over at SAIL for that ...

>
> I'll be happy to run the tests on the system with significant amount of
> data if you need a testbed.

All profiling support is appreciated, but I am not sure how to operationalise 
testing on our servers (SQL profiling would probably need server access, 
which is not possible in this case). Insights on JavaScript performance are 
also useful, but I guess that MySQL tuning could be most important for 
approaching large sites. I you have know about DB optimisation, you can also 
have a look at our DB layout and at the SQL queries we generate 
(format=debug).

Thanks,

Markus

>
> On Dec 16, 2007 8:56 AM, Markus Krötzsch <[EMAIL PROTECTED]> wrote:
> > On Freitag, 14. Dezember 2007, Sergey Chernyshev wrote:
> > > Got it - if it'll speed up the process, that'll be great. Currently SMW
> >
> > on
> >
> > > top of MW runs significantly slower then just MW which is not very good
> > > because it means that SMW+MW can't scale as good as MW alone.
> > >
> > > Can you describe in a couple of paragraphs how SMW data and queries are
> > > getting cached and how that cache is being invalidated, what works on
> >
> > the
> >
> > > fly and what is served from parser cache.
> > >
> > > I understand it's a lot to describe, but projects with massive amount
> > > of data and traffic, performance can be a big show-stopper - we picked
> > > MW
> >
> > for
> >
> > > one of our projects because of Wikipedia performance example and
> > > predictability and I hope that it's not too distant for SMW to inherit
> > > these qualities, but I'd like to understand the overall picture.
> >
> > Yes, agreed. Of course we have always designed basic algorithms with
> > regards
> > to performance and scalability, and especially tried to pick features
> > based
> > on this aspect. On the other hand, caching is significantly
> > under-developed
> > in SMW as it is, since it mainly uses the existing MW caches where
> > applicable. There are various types of operations that are relevant to
> > performance, and each can probably be optimised/cached independently:
> >
> > (1) Basic page display -- by far the most common operation.
> > (2) Query answering, inline and on Special:Ask
> > (3) Annotation parsing and page formatting.
> > (4) Maintenance specials such as Special:Properties.
> > (5) OWL/RDF export.
> > (6) Browsing special Special:Browse
> >
> > I will sketch performance issues for each of those. For actual numbers,
> > see
> > http://ontoworld.org/profileinfo.php to find out how severe each
> > operation is
> > on ontoworld.org.
> >
> > (1) is clearly the main operation, and for existing pages SMW merely uses
> > MW's
> > parser/page caches. No mechanism for cache invalidation exists, but MW
> > regularly updates page caches. This allows outdated inline queries but
> > gives
> > us good hope for basic scalability in large environments.  Especially SMW
> > does not hook into any operations that happen when reproducing parser
> > cached
> > pages. Even the Factbox comes from the parser cache (which is why we
> > cannot
> > readily translate it to the user's language as MW does for categories).
> >
> > (2) Query answering is done without any caching, and this is clearly a
> > problem. While inline queries are computed only once and stored in the
> > parser
> > cache afterwards, Special:Ask has no caching facility at all. This needs
> > to
> > change in the future. Targetted cache invalidation might still be
> > difficult
> > and it is not clear whether the effort is needed (one could enable manual
> > cache clearing like for pages). A new query cache -- design, architecture
> > and
> > implementation -- is needed here.
> >
> > (3) Page formatting uses very few additional DB calls, and mainly works
> > on the
> > wiki source code that was already retrieved anyway. It has no major
> > performance impact (see smwfParserHook in the profile).
> >
> > (4) Maintenance special can be slow, but have been designed to allow the
> > caching mechanism that MW uses for its maintenance specials. This is not
> > implemented, but it would be possible. One design decision, probably in
> > more
> > cases, is whether to have transparent caching in the sotrage
> > implementation,
> > or whether to trigger caching explicitly in the caller (which may help to
> > not
> > make the storage implementation even bigger than it is now).
> >
> > (5) OWL/RDF export take time, but mostly depending on the export settings
> > of
> > your site. The result could be cached internally in a similar way that
> > page-content is cached. External caches could be configured to cache RDF
> > as
> > well. Yet this is not to be neglected, since a number of Semantic Web
> > crawlers and misguided RSS-spiders regularly visit the RDF.
> >
> > (6) Special:Browse is not inefficient, but as it is a specialised form
> > of "What links here" it also faces similar performance issues.
> >
> > Finally, SMW needs practically no time to load if it is not strictly
> > needed.
> > So enabling it does hardly slow down the wiki for services that need no
> > SMW.
> >
> >
> > Summing up, the required caching facilities in order of relevance would
> > probably be: (2) [Queries], (4) [Specials], (5) [OWL/RDF]. I do not think
> > that the other parts need to much care, but analysing the current
> > profileinfo
> > may yield more insights. Concerning (2), which is by far the most severe
> > performance problem, we have included many ways of restricting queries,
> > so that large sites can always switch off features until it works again
> > (SMW is
> > still useful without very complex queries). At the moment this is the
> > suggested procedure for large sites, and we can also offer some support
> > for
> > helping such sites to not experience major problems (things of course
> > also depend a lot on the wiki's actual structure).
> >
> > Best regards,
> >
> > Markus
> >
> > > Thank you,
> > >
> > >               Sergey.
> > >
> > > On Dec 14, 2007 1:12 PM, Markus Krötzsch <[EMAIL PROTECTED]>
> >
> > wrote:
> > > > On Freitag, 14. Dezember 2007, Sergey Chernyshev wrote:
> > > > > Markus, can you elaborate on three values - what's the difference
> > > >
> > > > between
> > > >
> > > > > SOME and FULL?
> > > >
> > > > FULL is what used to be "true" in 1.0 (default)
> > > > NONE is what used to be "false" in all versions
> > > > SOME is new, but does basically what 0.7 did earlier.
> > > >
> > > > So SOME only considers redirects for pages that appear directly in
> > > > the query.
> > > > For example, assume "r1" and "r2" are redirects to "p". Then asking
> > > > for "[[property::r1]]" yields the same results as asking
> > > > for "[[property::p]]" or "[[property::r1]]".
> > > >
> > > > This is not too hard to do. Now FULL evaluates redirects even when
> > > > joining subqueries or asking for categories. As an example, assume
> >
> > that
> >
> > > > in addition
> > > > to the above there is a page "q" with annotation "[[property::r1]]",
> >
> > and
> >
> > > > assume further that r2 is in Category2 and that p is in Category3.
> >
> > Then
> >
> > > > each
> > > > of the following queries contains "q" in its result list:
> > > >
> > > > * <ask>[[property::<q>[[Category:Category3]]</q>]]</ask>
> > > > * <ask>[[property::<q>[[Category:Category2]]</q>]]</ask>
> > > >
> > > > Neither would work with SOME only. But as you can imagine, doing
> > > > these additional considerations about redirects at query time
> > > > consumes a lot
> >
> > of
> >
> > > > additional time (in particular since we use MW's redirect table that
> >
> > is
> >
> > > > not
> > > > even optimised for these kind of games).
> > > >
> > > > If you make sure that properties do not point to redirects, and that
> > > > redirects
> > > > have no categories or properties, then SOME should always suffice (I
> > > > think it
> > > > was discussed earlier to have a Special page for that kind of
> > > > maintenance).
> > > >
> > > > -- Markus
> > > >
> > > > >            Sergey



-- 
Markus Krötzsch
Institut AIFB, Universät Karlsruhe (TH), 76128 Karlsruhe
phone +49 (0)721 608 7362        fax +49 (0)721 608 5998
[EMAIL PROTECTED]        www  http://korrekt.org

signature.asc
Description: This is a digitally signed message part.

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

_______________________________________________
Semediawiki-devel mailing list
Semediawiki-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/semediawiki-devel

Re: [SMW-devel] Performance: (Was: {{#ask}})

Reply via email to