[Tutor] short url processor

2011-05-13 Thread ian douglas

Hey folks,

I'm rewriting a short url processor for my job. I had originally written 
it as a multi-threaded Perl script, which works, but has socket problems 
causing memory leaks. Since I'm rebuilding it to use memcache, and since 
I was learning Python outside of work anyway, figured I'd rewrite it in 
Python.


I'm using BaseHTTPServer, overriding do_GET and do_POST, and want to set 
up a custom logging mechanism so I don't have to rewrite a separate log 
parser, which I'll eventually rewrite in Python as well.


The problem I'm having, though, is that the BaseHTTPServer setup is 
outputting what appears to be an apache-style log to STDOUT, but the 
logging.debug or logging.info calls I make in the code are also going to 
STDOUT despite my attempt to use logging.basicConfig() overrides and 
setting a filename, etc.


Here's the basics of what I'm doing. Forgive my code, I've already been 
told it's ugly, I'm new to Python and come from a background of Perl/PHP.



import struct
import string,cgi,time
import psycopg
import logging
import re
import memcache
from BaseHTTPServer import BaseHTTPRequestHandler, HTTPServer
from time import strftime,localtime


class clientThread(BaseHTTPRequestHandler):
def 
log_my_request(self,method,request,short_url,http_code,long_url,cached,notes):

logging.debug(%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s,
self.client_address[0],
time.strftime(%Y-%m-%d %H:%M:%S,localtime()),
method, # get or post
request, # requested entity
short_url, # matching short_url based on 
entity, if any

http_code, # 200, 301, 302, 404, etc
long_url, # url to redirect to, if there was one
cached, # 'hit', 'miss', 'miss-db', 'error'
notes # extra notes for the log file only
)
return

def do_GET(self)
# logic goes here for finding a short url form 
memcache, then writing the appropriate

# output data to the socket, then logging happens:

self.log_my_request(getpost,orig_short_url,short_url,'302',long_url,'hit','')

return

def main():
if mc.get('dbcheck'): # memcache already has some data
print(memcache already primed with data)
else: # nothing in memcache, so load it up from database
print('Connecting to PG')
cur.execute(SELECT count(*) FROM short_urls) ;
mycount = cur.fetchone() ;
print(fetching %s entries, mycount)
cur.execute(SELECT short_url,long_url FROM short_urls)
giant_list = cur.fetchall()

# cache a marker that tells us we've already 
initialized memcache with db data

mc.set('dbcheck','databasetest',0)

# I'm sure there's a MUCH more efficient way of doing 
this ... multi-set of some sort?

for i in giant_list:
if i[0]:
if i[1]:
mc.set(i[0], i[1])

print(finished retrieving %s entries plus set up a new 
dictionary with all values % mycount)


#{{ set up the socket, bind to port, and wait for incoming 
connections

try:
server = HTTPServer(('',8083), clientThread)
print 'short url processing has begun'

# this is where I try to tell Python that I only want 
my message in my log:
# no INFO:username prefix, etc., and also to write it 
to a file

logging.basicConfig(level=logging.DEBUG)
logging.basicConfig(format='%(message)s', 
filename='/tmp/ian.txt')


server.serve_forever()
except KeyboardInterrupt:
print '^C received, shutting down server'
server.socket.close()


My code runs without any errors, though I have left some code out of 
this Email that I didn't feel was relevant such as the logic of seeing 
if a short url exists in memcache, trying to fetch from the db if there 
was no match, and if the db lookup also fails, force-deleting short urls 
from memcache based on other instructions, that sort of thing. None of 
it deals with logging or the BaseHTTPServer code.


To recap, the code runs, redirects are working, but ALL output goes to 
STDOUT. I can understand that print statements would go to STDOUT, but 
the BaseHTTPServer seems to want to write the Apache-style log to 
STDOUT, and my logging.info() call also prints to STDOUT instead of my file.


I'd love to hear any thoughts from people that have had to deal with 
this. The logging is the last piece of the puzzle for me.


Thanks,
Ian
___
Tutor maillist  -  Tutor@python.org
To 

Re: [Tutor] short url processor

2011-05-13 Thread Alan Gauld


ian douglas ian.doug...@iandouglas.com wrote


outputting what appears to be an apache-style log to STDOUT, but the
logging.debug or logging.info calls I make in the code are also 
going to

STDOUT despite my attempt to use logging.basicConfig() overrides and
setting a filename, etc.


I don;t know anything about BaseHTTPServer and not much
about the logging modules however some thoughts are...

How do you know they are going to stdout? Are you sure
they aren't going to stderr and stderrr is not mapped to stdout
(usually the default). Have you tried redirecting stderr to a
file for example?

As I say, just some thoughts,

Alan G. 



___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] short url processor

2011-05-13 Thread ian douglas

On 05/13/2011 05:03 PM, Alan Gauld wrote:

How do you know they are going to stdout? Are you sure
they aren't going to stderr and stderrr is not mapped to stdout
(usually the default). Have you tried redirecting stderr to a
file for example?

As I say, just some thoughts,



Thanks for your thoughts, Alan. I had done some testing with cmdline 
redirects and forget which is was, I think my debug log was going to 
stdout and the apache-style log was going to stderr, or the other way 
around...


After a handful of guys in the #python IRC channel very nearly convinced 
me that all but 3 stdlib libraries are utter worthless crap, and telling 
me to download and use third-party frameworks just to fix a simple 
logging issue, I overrode log_request() and log message() as such:


class clientThread(BaseHTTPRequestHandler): #[[[

def log_request(code, size):
return

def log_message(self, format, *args):
open(LOGFILE, a).write(%s\n % (format%args))


... and now the only logging going on is my own, and it's logged to my 
external file. Overriding log_request means that BaseHTTPServer no 
longer outputs its apache-style log, and overriding log_message means my 
other logging.info() and logging.debug() messages go out to my file as 
expected.


-id

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] short url processor

2011-05-13 Thread ian douglas

On 05/13/2011 05:03 PM, Alan Gauld wrote:

As I say, just some thoughts,



I *am* curious, Alan, whether you or anyone else on the list are able to 
help me make this a little more efficient:


cur.execute(SELECT short_url,long_url FROM short_urls)
giant_list = cur.fetchall()

for i in giant_list:
if i[0]:
if i[1]:
mc.set(i[0], i[1])


At present, we have about two million short URL's in our database, and 
I'm guessing there's a much smoother way of iterating through 2M+ rows 
from a database, and cramming them into memcache. I imagine there's a 
map function in there that could be much more efficient?


v2 of our project will be to join our short_urls table with its 'stats' 
table counterpart, to where I only fetch the top 10,000 URLs (or some 
other smaller quantity). Until we get to that point, I need to speed up 
the restart time if this script ever needs to be restarted. This is 
partly why v1.5 was to put the database entries into memcache, so we 
wouldn't need to reload the db into memory on every restart.


Thanks,
Ian

___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor


Re: [Tutor] short url processor

2011-05-13 Thread Nick Raptis


On 05/14/2011 03:49 AM, ian douglas wrote:

for i in giant_list:
if i[0]:
if i[1]:
mc.set(i[0], i[1])


Until Alan comes with a more round answer, I'd suggest something along 
the lines of


[mc.set(x, y) for (x, y) in giant_list if x and y]

I'm writing this by memory, but check list comprehension in the 
documentation.


Anyway, there are map, reduce and such functions in python, but I think 
that in python 3.x you have to import them.


Now, the real question would be, can you use the cursor as an iterator 
(but without hitting the database for each new record)?

Then you can skip the worst part of loading all the values in giant_list.
Just an idea for Alan and the others to answer.

Nick
___
Tutor maillist  -  Tutor@python.org
To unsubscribe or change subscription options:
http://mail.python.org/mailman/listinfo/tutor