On Wednesday, August 14, 2013 4:46:09 PM UTC+3, mar...@python.net wrote: > On Wed, Aug 14, 2013, at 09:18 AM, Guy Tamir wrote: > > > Hi all, > > > > > > I have a Ubuntu server running NGINX that logs data for me. > > > I want to write a python script that reads my customized logs and after > > > a little rearrangement save the new data into my DB (postgresql). > > > > > > The process should run about every 5 minutes and i'm expecting large > > > chunks of data on several 5 minute windows.. > > > > > > My plan for achieving this is to install python on the server, write a > > > script and add it to cron. > > > > > > My question is what the simplest way to do this? > > > should i use any python frameworks? > > > > Rarely do I put "framework" and "simplest way" in the same set. > > > > I would do 1 of 2 things: > > > > * Write a simple script that reads lines from stdin, and writes to the > > db. Make sure it gets run in init before nginx does and tail -F -n 0 to > > that script. Don't worry about the 5-minute cron. > > > > * Similar to above but if you want to use cron also store in the db the > > offset of the last byte read in the file, then when the cron job kicks > > off again seek to that position + 1 and begin reading, at EOF write the > > offset again. > > > > This is irrespective of any log rotating that is going on behind the > > scenes, of course.
Not sure i understood the first options and what it means to run before the nginx. The second options sound more like what i had in mind. Aren't there any components like this written that i can use? since the log fills up a lot i'm having trouble reading so much data and writing it all to the DB in a reasonable amount of time. The table receiving the new data is somewhat complex.. the table's purpose is to save data regarding ads shown from my app, the fields are - (ad_id,user_source_site,user_location,day_date,specific_hour,views,clicks) each row is distinct by the first 5 fields since i need to show different types of stats.. because each new line created may or may not be in the DB i have to run a upsert command (update or insert) on each row.. This leads to very poor performance.. Do have any ideas about how i can make this script more efficient? -- http://mail.python.org/mailman/listinfo/python-list