Re: [Wikitech-l] Current events-related overloads

2009-06-26 Thread Domas Mituzas

> This is a very good idea, and sounds much better than having those

the major problem with all dirty caching is that we have more than one  
caching layer, and of course, things abort.

the fact, that people should be shown dirty versions instead of proper  
article leads to situation where in case of vandal fighting, etc,  
people will see stale versions, instead of waiting few seconds and  
getting real one.

In theory, update flow could look like this:

1. Set "I'm working on this" in a parallelism coordinator or lock  
manager
2. Do all database transactions & commit
3. Parse
4. Set memcached object
5. Invalidate squid objects

Now, should we parse, block or serve stale, could be dynamic, e.g. if  
we detect more than x parallel parses we fall back to blocking for few  
seconds, once we detect more than y of blocked threads on the task, or  
block expires and there's no fresh content yet (or there's new  
copy.. ) - then stale stuff can be served.
In perfect world that asks for specialized software :)

Do note, for past quite a few years we did lots and lots of work to  
avoid stale content being served. I would not see dirty serving as  
something we should be proud of ;-)

Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Current events-related overloads

2009-06-26 Thread Roan Kattouw
2009/6/26 Aryeh Gregor :
> But this sounds like a good idea.  If a process is already parsing the
> page, why don't we just have other processes display an old cached
> version of the page instead of waiting or trying to reparse
> themselves?  The worst that would happen is some users would get old
> views for a couple of minutes.
>
This is a very good idea, and sounds much better than having those
other processes wait for the first process to finish parsing. It would
also reduce the severity of the deadlocks occurring when a process
gets stuck on a parse or dies in the middle of it: instead of
deadlocking, the other processes would just display stale versions
instead of wasting time. If we design these parser cache locks to
expire after a few minutes or so, it should work just fine.

Roan Kattouw (Catrope)

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Current events-related overloads

2009-06-26 Thread Aryeh Gregor
On Fri, Jun 26, 2009 at 6:33 AM, Thomas Dalton wrote:
> Of course, the fact that everyone's first port of call after hearing
> such news is to check the Wikipedia page is a fantastic thing, so it
> would be really unfortunate if we have to stop people doing that.

He didn't say we'd shut down views for the article, just we'd shut
down reparsing or cache invalidation or something.  This is the live
hack that was applied yesterday:

Index: includes/parser/ParserCache.php
===
--- includes/parser/ParserCache.php (revision 52359)
+++ includes/parser/ParserCache.php (working copy)
 -63,6 +63,7 @@
if ( is_object( $value ) ) {
wfDebug( "Found.\n" );
# Delete if article has changed since the cache was made
+   if( $article->mTitle->getPrefixedText() != 'Michael 
Jackson' ) {
// temp hack!
$canCache = $article->checkTouched();
$cacheTime = $value->getCacheTime();
$touched = $article->mTouched;
 -82,6 +83,7 @@
}
wfIncrStats( "pcache_hit" );
}
+   }// temp hack!
} else {
wfDebug( "Parser cache miss.\n" );
wfIncrStats( "pcache_miss_absent" );

It just meant that people were seeing outdated versions of the article.

> Would it be possible, perhaps, to direct all requests for a certain
> page through one server so the rest can continue to serve the rest of
> the site unaffected?

Every page view involves a number of servers, and they're not all
interchangeable, so this doesn't make a lot of sense.

> Or perhaps excessively popular pages could be
> rendered (for anons) as part of the editing process, rather than the
> viewing process, since that would mean each version of the article is
> rendered only once (for anons) and would just slow down editing
> slightly (presumably by a fraction of a second), which we can live
> with.

You think that parsing a large page takes a fraction of a second?  Try
twenty or thirty seconds.

But this sounds like a good idea.  If a process is already parsing the
page, why don't we just have other processes display an old cached
version of the page instead of waiting or trying to reparse
themselves?  The worst that would happen is some users would get old
views for a couple of minutes.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Current events-related overloads

2009-06-26 Thread Thomas Dalton
2009/6/26 Brion Vibber :
> Tim Starling wrote:
>> It's quite a complex feature. If you have a server that deadlocks or
>> is otherwise extremely slow, then it will block rendering for all
>> other attempts, meaning that the article can not be viewed at all.
>> That scenario could even lead to site-wide downtime, since threads
>> waiting for the locks could consume all available apache threads, or
>> all available DB connections.
>>
>> It's a reasonable idea, but implementing it would require a careful
>> design, and possibly some other concepts like per-article thread count
>> limits.
>
> *nod* We should definitely ponder the issue since it comes up
> intermittently but regularly with big news events like this. At the
> least if we can have some automatic threshold that temporarily disables
> or reduces hits on stampeded pages that'd be spiffy...

Of course, the fact that everyone's first port of call after hearing
such news is to check the Wikipedia page is a fantastic thing, so it
would be really unfortunate if we have to stop people doing that.
Would it be possible, perhaps, to direct all requests for a certain
page through one server so the rest can continue to serve the rest of
the site unaffected? Or perhaps excessively popular pages could be
rendered (for anons) as part of the editing process, rather than the
viewing process, since that would mean each version of the article is
rendered only once (for anons) and would just slow down editing
slightly (presumably by a fraction of a second), which we can live
with. There must be something we can do that allows people to continue
viewing the page wherever possible.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Current events-related overloads

2009-06-25 Thread Brion Vibber
Tim Starling wrote:
> It's quite a complex feature. If you have a server that deadlocks or
> is otherwise extremely slow, then it will block rendering for all
> other attempts, meaning that the article can not be viewed at all.
> That scenario could even lead to site-wide downtime, since threads
> waiting for the locks could consume all available apache threads, or
> all available DB connections.
> 
> It's a reasonable idea, but implementing it would require a careful
> design, and possibly some other concepts like per-article thread count
> limits.

*nod* We should definitely ponder the issue since it comes up 
intermittently but regularly with big news events like this. At the 
least if we can have some automatic threshold that temporarily disables 
or reduces hits on stampeded pages that'd be spiffy...

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Current events-related overloads

2009-06-25 Thread Tim Starling
Aryeh Gregor wrote:
> On Thu, Jun 25, 2009 at 8:14 PM, Domas Mituzas wrote:
>> The problem is quite simple, lots of people (like, million pageviews
>> on an article in an hour) caused a cache stampede (all pageviews
>> between invalidation and re-rendering needed parsing), and as MJ
>> article is quite cite-heavy (and cite problems were outlined in 
>> http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/41547
>>  ;) the reparsing was very very painful on our application cluster -
>> all apache children eventually ended up doing lots of parsing work and
>> consuming connection slots to pretty much everything :)
> 
> So if two page views are trying to view the same uncached page at the
> same time with the same settings, the later ones should all block on
> the first one's reparsing instead of doing it themselves.  It should
> provide faster service for big articles too, even ignoring load, since
> the earlier parse will be done before you could finish yours anyway.

It's quite a complex feature. If you have a server that deadlocks or
is otherwise extremely slow, then it will block rendering for all
other attempts, meaning that the article can not be viewed at all.
That scenario could even lead to site-wide downtime, since threads
waiting for the locks could consume all available apache threads, or
all available DB connections.

It's a reasonable idea, but implementing it would require a careful
design, and possibly some other concepts like per-article thread count
limits.

-- Tim Starling


___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


Re: [Wikitech-l] Current events-related overloads

2009-06-25 Thread Aryeh Gregor
On Thu, Jun 25, 2009 at 8:14 PM, Domas Mituzas wrote:
> The problem is quite simple, lots of people (like, million pageviews
> on an article in an hour) caused a cache stampede (all pageviews
> between invalidation and re-rendering needed parsing), and as MJ
> article is quite cite-heavy (and cite problems were outlined in 
> http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/41547
>  ;) the reparsing was very very painful on our application cluster -
> all apache children eventually ended up doing lots of parsing work and
> consuming connection slots to pretty much everything :)

So if two page views are trying to view the same uncached page at the
same time with the same settings, the later ones should all block on
the first one's reparsing instead of doing it themselves.  It should
provide faster service for big articles too, even ignoring load, since
the earlier parse will be done before you could finish yours anyway.

That seems pretty easy to do.  You'd have some delays if everything
waited on a process that died or something, of course.

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Current events-related overloads

2009-06-25 Thread Domas Mituzas
Hi!

> Just a quick note -- we're experiencing some fun load spikes due to
> heavy net usage of people searching for or talking about Michael
> Jackson's reported death or near-death.

tech blog doesn't seem to get much updates.

The problem is quite simple, lots of people (like, million pageviews  
on an article in an hour) caused a cache stampede (all pageviews  
between invalidation and re-rendering needed parsing), and as MJ  
article is quite cite-heavy (and cite problems were outlined in 
http://article.gmane.org/gmane.science.linguistics.wikipedia.technical/41547 
  ;) the reparsing was very very painful on our application cluster -  
all apache children eventually ended up doing lots of parsing work and  
consuming connection slots to pretty much everything :)

Cheers,
Domas

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l


[Wikitech-l] Current events-related overloads

2009-06-25 Thread Brion Vibber
Just a quick note -- we're experiencing some fun load spikes due to 
heavy net usage of people searching for or talking about Michael 
Jackson's reported death or near-death.

You may see some intermittent database connection failures on 
en.wikipedia.org for a little while as connections back up; we're poking 
to see if we can reduce this.

Updates at tech blog:
http://techblog.wikimedia.org/2009/06/current-events/

-- brion

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l