[sqlalchemy] Re: querying into objects from a large flat table or have multiple processed tables mapped to objects?
I wish I could make pathetic whining noises and puppydog eyes... If anyone with some relational nouse could give me even some vague pointers that would be great... Chris Chris Withers wrote: Hi All, I'm looking to do (something like) weblog analysis and thought it'd be interesting to try an rdb through SQLAlchemy. Apologies in advance, my knowledge of rdb and sa is much poorer than I'd like, so simple pointers may help... So, assuming I have data something like: 2009-05-20-00:01:02.345 user1 192.168.1.1 /some/folder/some/path 2009-05-20-00:01:03.452 user1 192.168.1.1 /some/folder/other/path 2009-05-20-00:01:02.345 user2 192.168.1.10 /some/folder/some/path ...what would people recommend for storage? One table mapped to a Row class or some such? What would the primary key be? (since two entries could theoretically happen at exactly the same time, I can't use the date...) I'd like to then aggregate first by user, then by month, giving a UI like: Users = Username Total PagesLast Viewed ------ user1 32434 27th July 2009 16:05 user2 15000 1st Jan 2009 01:05 ...lots more users, probably batched... Clicking on a username gives: User1 = Month Total Pages Last Viewed -- --- --- May 2009 1000 15th May 2009 13:50 April 2009 1000 30th April 2009 23:55 ...lots of months... Clicking on a month gives: User1 - May 2009 = Page Hits Last Viewed --- - --- /some/folder/some/path 20 15th May 2009 13:50 /some/folder/other/path 33 1st May 2009 13:50 What queries would I use to generate these views? I'm envisaging some group-by's here, but don't know how I'd plug those into SA. Would I have classes for users, months and pages that had selects map to them or something else? How would I batch them such that if there are several thousand rows in a table I can show the old 1 to 20 of 3000 without having to load all 3000 rows? Another option I thought of was to not actually store the raw rows, but have several tables: class Users(Base): __tablename__ 'users' name = Column(String,primary_key=True) total_pages = Column(Integer) last_viewed = Column(DateTime) class Month(Base): __tablename__ 'months' username = Column( String,ForeignKey('users.username'),primary_key=True ) monthname = Column(String,primary_key=True) total_pages = Column(Integer) last_viewed = Column(DateTime) class Page(Base): __tablename__ 'users' username = Column( String,ForeignKey('months.username'),primary_key=True ) monthname = Column( String,ForeignKey('months.monthname'),primary_key=True ) hits = Column(Integer) last_viewed = Column(DateTime) ...and then populate them while parsing the logs. I can see this storing a lot less data, but would it be faster? Still, how would I do batching of results? Ideally, I'd prefer not to go down this route as it limits different types of analysis later down the line but... What do people think? What's the normal/correct way of handling large amounts of data to be sliced and diced like this? cheers, Chris -- Simplistix - Content Management, Zope Python Consulting - http://www.simplistix.co.uk --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---
[sqlalchemy] Re: querying into objects from a large flat table or have multiple processed tables mapped to objects?
the best thing to do would be to experiment with some various schemas and see what works best (and maybe read some SQL books or books about web log data mining if you feel you want some background). The primary key of your table would be easiest as a surrogate integer key, group by is provded by the group_by() method of select() or Query. You also might want to investigate the star schema approach which is popular for the slicing and dicing approach: http://en.wikipedia.org/wiki/Star_schema . On May 25, 2009, at 12:08 PM, Chris Withers wrote: I wish I could make pathetic whining noises and puppydog eyes... If anyone with some relational nouse could give me even some vague pointers that would be great... Chris Chris Withers wrote: Hi All, I'm looking to do (something like) weblog analysis and thought it'd be interesting to try an rdb through SQLAlchemy. Apologies in advance, my knowledge of rdb and sa is much poorer than I'd like, so simple pointers may help... So, assuming I have data something like: 2009-05-20-00:01:02.345 user1 192.168.1.1 /some/folder/some/path 2009-05-20-00:01:03.452 user1 192.168.1.1 /some/folder/other/path 2009-05-20-00:01:02.345 user2 192.168.1.10 /some/folder/some/path ...what would people recommend for storage? One table mapped to a Row class or some such? What would the primary key be? (since two entries could theoretically happen at exactly the same time, I can't use the date...) I'd like to then aggregate first by user, then by month, giving a UI like: Users = Username Total PagesLast Viewed ------ user1 32434 27th July 2009 16:05 user2 15000 1st Jan 2009 01:05 ...lots more users, probably batched... Clicking on a username gives: User1 = Month Total Pages Last Viewed -- --- --- May 2009 1000 15th May 2009 13:50 April 2009 1000 30th April 2009 23:55 ...lots of months... Clicking on a month gives: User1 - May 2009 = Page Hits Last Viewed --- - --- /some/folder/some/path 20 15th May 2009 13:50 /some/folder/other/path 33 1st May 2009 13:50 What queries would I use to generate these views? I'm envisaging some group-by's here, but don't know how I'd plug those into SA. Would I have classes for users, months and pages that had selects map to them or something else? How would I batch them such that if there are several thousand rows in a table I can show the old 1 to 20 of 3000 without having to load all 3000 rows? Another option I thought of was to not actually store the raw rows, but have several tables: class Users(Base): __tablename__ 'users' name = Column(String,primary_key=True) total_pages = Column(Integer) last_viewed = Column(DateTime) class Month(Base): __tablename__ 'months' username = Column( String,ForeignKey('users.username'),primary_key=True ) monthname = Column(String,primary_key=True) total_pages = Column(Integer) last_viewed = Column(DateTime) class Page(Base): __tablename__ 'users' username = Column( String,ForeignKey('months.username'),primary_key=True ) monthname = Column( String,ForeignKey('months.monthname'),primary_key=True ) hits = Column(Integer) last_viewed = Column(DateTime) ...and then populate them while parsing the logs. I can see this storing a lot less data, but would it be faster? Still, how would I do batching of results? Ideally, I'd prefer not to go down this route as it limits different types of analysis later down the line but... What do people think? What's the normal/correct way of handling large amounts of data to be sliced and diced like this? cheers, Chris -- Simplistix - Content Management, Zope Python Consulting - http://www.simplistix.co.uk --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups sqlalchemy group. To post to this group, send email to sqlalchemy@googlegroups.com To unsubscribe from this group, send email to sqlalchemy+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/sqlalchemy?hl=en -~--~~~~--~~--~--~---
[sqlalchemy] Re: querying into objects from a large flat table or have multiple processed tables mapped to objects?
[web logs - db] On Tuesday 26 May 2009 00.27:03 Michael Bayer wrote: the best thing to do would be to experiment with some various schemas and see what works best Also, it's extremely important to keep in mind that SQL databases can only work well with big tables if you create the right indices. What kind of index to create depends on the database you'll use and on which queries you'll run. (And: don't use a small example data set to decide which indices to build. A database often will change its query plan, sometimes drastically, depending on how much data is in a table.) So: create your schema, fill it with a few million log entries and then look at which queries might need which indices. In the case of web logs, you'll probably want an index on every base field (tinestamp, user, IP, URL string), but depending on your exact queries, combined indices on multiple column, or indices on functions of fields (like, perhaps, an index on the ending of the url to quickly filter jpg/png requests vs. html/php requests vs. requests on directories, ending by '/') might speed up your queries dramatically. cheers -- vbi -- Even though I use zsh, holding down the TAB key just doesn’t write my code for me, so I have to use a text editor. -- Scott James Remnant signature.asc Description: This is a digitally signed message part.