On 29 Feb 2012, at 01:13, Daevid Vincent wrote:

>> -----Original Message-----
>> From: Stuart Dallas [mailto:stu...@3ft9.com]
>> 
>> Seriously? Errors like this should not be getting anywhere near your
>> production servers. This is especially true if you're really getting 30k
>> hits/s.
> 
> Don't get me started. I joined here almost a year ago. They didn't even have
> a RCS or Wiki at the time. Nothing was OOP. There were no PHPDoc or even
> comments in the code. They used to make each site by copying an existing one
> and modifying it (ie. no shared libraries or resources). I could go on and
> on. Suffice it to say we've made HUGE leaps and bounds (thanks to me!), but
> there is only 3 of us developers here and no official test person let alone
> a test team.
> 
> It is what it is. I'm doing the best I can with the limited resources
> available to me.

Good stuff, but the idea that you need an official test person or a test team 
to produce solid code that minimises runtime errors is, in my opinion, 
completely the wrong attitude. I've been in similar situations several times 
and have found that the key is not to try and solve the problem in one big 
push. The key to solving the problem of a large codebase with minimal testing 
is simply to start somewhere.

Put in the infrastructure for unit testing, then make writing tests a part of 
your standard development process. Over time you will find that you are unit 
testing the majority of the code. Yes, that will make things take longer, but 
you can also be confident that when you fix a bug it stays fixed, because you 
know there's a unit test that verifies that the bug has not returned.

In my experience and opinion, limited resources is a big reason to implement 
some level of unit testing as soon as humanly possible, not a reason why you 
can't. Once you have the unit testing infrastructure in place, make running the 
tests the first step in your deployment process. You mention you now use a 
version control system, consider adding a hook to require that the unit tests 
pass before allowing code to be committed. Alternatively implement a CI 
environment which publicly ridicules anyone who checks in code which breaks the 
unit tests - it's amazing how much of a motivator this can be, even in a small 
team of seasoned professionals.

> And let me tell you a little secret, when you get to that scale, you see all
> kinds of errors you don't see on your VM or even with a test team. DB
> connections go away. Funny things happen to memcache. Concurrency issues
> arise. Web bots and search engines rape, pillage and ravage your site in
> ways that make you feel dirty. So yeah, you do hit weird situations and
> cases you can't possibly test for, but show up in error logs.

Not a secret. Not even close to being a secret. I'm no stranger to sites with 
the sort of traffic you have, and then some, and I'm fully aware that it 
presents a unique set of challenges, but there are simple steps you can take to 
make life easier.

Most of the specific issues you mention (DB connections, memcache weirdness, 
concurrency problems, and crawler activity) are the result of poor architecture 
and/or flawed server configuration. Where the architecture is poor I'd 
recommend you design a new architecture and find a way that you can start 
moving towards it, piece by piece, without having too much impact on your 
day-to-day activities. This is not always easy but I've done it several times 
and know it can be done in most situations.

There will always be issues that crop up that you couldn't possibly have known 
would happen, but you can load test your app to see what happens at levels of 
traffic an order of magnitude above that which you expect. One of my current 
clients has a test tool that can generate traffic levels that hit the limit of 
EC2 network throughput specifically to see what would happen if they had a 
sudden and dramatic increase in usage. Knowing the application can cope at 10x 
the expected level of traffic is the only way you can be sure that it can cope 
with 1x the expected traffic without breaking a sweat.

There will always be situations that you don't foresee, and conditions that are 
difficult to test, but saying that you "can't possibly test" for them is simply 
wrong.

>> For a commercial, zero-hassle solution I can't recommend
>> http://newrelic.com/ highly enough. Simple installation followed by highly
>> detailed reports with zero issues (so far). They do a free trial of all
> the
>> pro features so you can see if it gets you what you need. And no, I don't
>> work for them, I just think they've built a freakin' awesome product
> that's
>> invaluable when diagnosing issues that only occur in production. I've
> never
>> used it on a site with that level of traffic, and I'm sure it won't be a
>> problem, but you may want to only deploy it to a fraction of your
>> infrastructure.
> 
> A quick look at that product seems interesting, but not what I really need.
> We have a ton of monitoring solutions in place to get metrics and
> performance data. I just need a good 'hook' to get details when errors
> occur.

You obviously didn't look closely enough. NewRelic is a PHP extension which 
hooks in to errors (and many other things) and provides detailed information 
for everything that happens while your application is running. Do yourself a 
favour and try it.

As an example I recently diagnosed a snowball performance problem with a 
ColdFusion application by simply installing NewRelic and waiting for the next 
time the server came crashing down. Without the insights that tool gave me it 
would have taken me many times longer to identify and fix the root cause.

>> If you want a homemade solution, the uncaught exceptions are easily dealt
>> with... CATCH THEM, do something useful with them, and then die
> gracefully.
>> Rocket science this ain't!
> 
> Thanks captain obvious. :)

If it was obvious why did you feel the need to ask the question?

> I can do that (and did do that), but again, at these scales, all the
> text-book code you think you know starts to go out the window. Frameworks
> break down. RDBMS topple over. You have to write things creatively, leanly
> (and sometimes error on the side of 'assume something is there' rather than
> 'assume the worst' or your code spends too much time checking the edge
> cases). Hit it and quit it! Get in. Get out. I can't put try/catch around
> everything everywhere, it's just not efficient or practical. Even the SQL
> queries we write are 'wide' and we pull in as much logical stuff as we can
> in one DB call, get it into a memcache slab and then pull it out of there
> over and over, rather than surgical queries to get small chunks of data
> which would murder mySQL.

This wreaks of architectural problems. I understand that it's a codebase that 
you've inherited and that you're doing your best with it, but what you're 
describing are not features of a well-designed, scalable web application. The 
idea that MySQL is best used to pull large datasets rather than just exactly 
what you need makes my skin crawl. You may want to consider having an offline 
process populate Memcache, or look at your DB schema to see if there's a better 
way to store the data with a view to optimising access to it.

Oh, and assumptions are the mother of all screw ups. Your logs would be far 
more useful if the code caught and properly dealt with problems as they 
occurred. Yes, there will be a slight (and I mean very slight) performance hit 
for catching exceptions and checking for error conditions, but do you really 
believe that most of the large, complex applications out there are not doing 
these things? Solid code is far more important than fast code. Servers are 
cheap, your time is not. Do the maths!

> Part of the reason I took this job is exactly because of these challenges
> and I've learned an incredible amount here (I've also had to wash the guilt
> off of me some nights, as some code I've written goes against everything I
> was taught and thought I knew for the past decade -- but it works and works
> well -- it just FEELS wrong). We do a lot of things that would make my
> college professors cringe. THAT is the difference between the REAL world and
> the LAB. ;-)

Granted, in the "real world" you cut corners, but these are now accepted 
techniques and should no longer feel wrong. For example, fully normalised 
databases are painful in a web context, but the idea of de-normalising the 
schema and duplicating data to optimise for access would make anyone who 
prefers the "right" way feel dirty. But sometimes it's necessary.

Not doing things the academic way should not make you feel dirty. If it does 
I'd suggest you look at specifically what it is about what you're doing that 
makes you feel that way, because if what you're doing is backed up by valid 
reasons it should not feel dirty.

>> See the set_exception_handler function for an
>> easy way to set up a global function to catch uncaught exceptions if you
>> don't have a limited number of entry points.
>> 
>> You can similarly catch the warnings using the set_error_handler function,
>> tho be aware that this won't be triggered for fatal errors.
> 
> And this is the meat of the solution. Thanks! I'll look into these handlers
> and see if I can inject it into someplace useful. I have high hopes for this
> now.

I still maintain that using NewRelic would be far more efficient and 
controllable, but you can certainly roll your own solution. You may also want 
to check out tools like Scribe and Splunk to assist with managing and examining 
your log files.

>> But seriously... a minimal level of structured testing would prevent
> issues
>> like this being deployed to your production servers. Sure, instrument to
>> help resolve these issues now, but if I were you I'd be putting a lot of
>> effort into improving your development process. Contact me off-list if
> you'd
>> like to talk about this in more detail.
> 
> See above. I have begged for even a single dedicated tester. I have offered
> to sacrifice the open req I had for a junior developer to get a tester. That
> resulted in them taking away the req because "clearly I didn't need the
> developer then" and "we can just test it ourselves". You're preaching to the
> choir my friend. I've been doing this for 15+ years at various companies.
> ;-)


You don't need a dedicated tester, and even if you did have one that doesn't 
mean that you don't need to test it yourselves. I rarely find myself on the 
same side as an employer [unless it's me :)] but yours is spot on, firstly 
because if you're happy to compromise on a developer to get a tester then you 
didn't really need another developer, but primarily because you should be 
testing it yourselves. If you think a dedicated tester will absolve you of your 
responsibility to test your own stuff then you have a lot more to learn.

Tools that enable you to automate a lot of what a dedicated tester would do are 
legion. Unit tests, CI systems, Selenium, and others will propel your 
organisation on the way to building solid software that doesn't fill your logs 
up with repeated messages that arise simply because the developer didn't test a 
variety of inputs for a given function, both valid and invalid.

I hope I haven't come across as too preachy or rude in these two emails, but 
I've heard the arguments you're making many times and they just don't hold 
water for me. You may have been doing this for 15+ years, but have you done it 
at this scale before? Have you done it in a small company that has the proper 
processes and tools in place?

I hope my comments prove useful, and my offer to discuss this off-list stands.

-- 
Stuart Dallas
3ft9 Ltd
http://3ft9.com/
--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to