JS, I need one more piece of information to help make that query work for you, I need the structure of the table that you use (or would use) to bulk import those logs into.
If you are running out of room you may consider further normalizing you data (which saves space, but creates more data maintenance steps). I, too, have had to deal with millions of rows of internet usage logs so I understand your pain. You can store your IP addresses in an INT and get them back in dotted notation with the MySQL functions INET_ATON() and INET_NTOA(). That will save you an average of 10 bytes PER ROW (it adds up when you are into the millions of rows). Create a Proxy_Action table (id tinyint, action varchar(20) ) and populate it with all of the actions your proxy/firewall can log. Then replace the column "action varchar(20)" with "ProxyAction_id tinyint". (I assume there are less than 256 "action" messages available from your proxy?) That's 1 byte vs. 8 at the low end of your sample data (counting the null at the end of the string). First add the new column to the table, populate it with the ID values from your new Action table, then drop the old column. URLs consist (in a basic sense) of the server portion (to the left of the first single / ) and the path portion (right of the first / and left of a ? or #) and either a fragment (after the #) or a query string (after the ?) I would at least split out the server portion into it's own table. For each page request (assume 1 page and 9 pictures, all from the same server) that would be 10 rows of log data that all contain the same chunk of similar information. Reducing that heavily repeated portion of your data to an ID number will greatly help reduce the size of your database. About the non-uniqueness of your internet_usage table. Even if the same user visits the same URL multiple times (is that what you mean by repeated records?) the times should all be slightly different. If they are not different, it is still possible that the same person requested the same page twice or more during the same second (the auto-login feature of MSIE comes to mind as one culprit). OR you could have multiple users all on the same userID hitting the same page from different machines.... I guess I would have to see more data to understand your problem better. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine "J S" <[EMAIL PROTECTED] To: [EMAIL PROTECTED] com> cc: [EMAIL PROTECTED] Fax to: 06/18/2004 09:40 Subject: Re: load data into 2 tables and set id AM Shawn, Thanks for helping on this. I really appreciate it. >No problem!! > >Please post the structures of your "big_table" and your "url_table" >(whatever you called them) and I will help you to rewrite step 4 to count >how many times a URL appears in the "big_table". mysql> desc internet_usage; +--------+-------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +--------+-------------+------+-----+---------+-------+ | uid | varchar(10) | YES | MUL | NULL | | | time | datetime | YES | | NULL | | | ip | varchar(20) | YES | | NULL | | | action | varchar(20) | YES | | NULL | | | urlid | int(11) | YES | | NULL | | | size | int(11) | YES | | NULL | | +--------+-------------+------+-----+---------+-------+ 6 rows in set (0.03 sec) mysql> desc url_table; +----------+--------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +----------+--------------+------+-----+---------+----------------+ | urlid | mediumint(9) | | PRI | NULL | auto_increment | | url | text | YES | MUL | NULL | | | hits | mediumint(9) | YES | | NULL | | | category | varchar(50) | YES | | NULL | | +----------+--------------+------+-----+---------+----------------+ 4 rows in set (0.01 sec) mysql> > >Is this bulk import a process you need to do repeatedly? if so we need to >worry about updating the count column on the next batch import and not just >re-creating it from the new data. That will change the query significantly. Yes it will be a repeated process. Actually I have a backlog of 6 months data to load! Here's an example of what the data looks like: mysql> select * from internet_usage limit 5; +---------+---------------------+----------------+--------------+-------+------+ | uid | time | ip | action | urlid | size | +---------+---------------------+----------------+--------------+-------+------+ | n58396 | 2004-06-07 21:12:16 | 21.38.25.204 | TCP_TUNNELED | 5999 | 5297 | | u344584 | 2004-06-07 21:07:12 | 21.33.136.74 | TCP_HIT | 4494 | 438 | | - | 2004-06-07 21:07:02 | 21.38.92.76 | TCP_NC_MISS | 2134 | 771 | | u524797 | 2004-06-07 21:03:27 | 21.32.25.41 | TCP_NC_MISS | 260 | 582 | | - | 2004-06-07 21:09:13 | 21.201.130.240 | TCP_HIT | 3112 | 542 | +---------+---------------------+----------------+--------------+-------+------+ mysql> select * from url_table limit 5; +-------+-----------------------------------------------------------------------------------+------+---------------+ | urlid | url | hits | category | +-------+-----------------------------------------------------------------------------------+------+---------------+ | 1 | http://www.bbc.co.uk/horoscopes/chinesehoroscopes/images/hp-snake.gif | NULL | Entertainment | | 2 | http://www.call18866.co.uk/images/logo.jpg | NULL | none | | 3 | http://m2.doubleclick.net/866421/0409_santoku_250.gif | NULL | none | | 4 | http://lysto1-dc02.ww.ad.ba.com/ | NULL | Travel | | 5 | http://www.aboutscotland.com/edin/newstreetph/sitview.jpg | NULL | Travel | +-------+-----------------------------------------------------------------------------------+------+---------------+ 5 rows in set (0.00 sec) One other problem I'm having here is making the rows in internet_usage unique. At the moment I have lots of duplicates, and I was trying to create a temporary table but unfortunately got an error 27 (I think this refers to a 2GB limit). mysql> CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM internet_usage; ERROR 1030: Got error 27 from table handler Is there another way of doing this? > >Yours, >Shawn Green >Database Administrator >Unimin Corporation - Spruce Pine > > > > >Mos forgot to populate the url_id column in your user table. I would use > >his same process but re-arrange it like this: > > > >1) create table BIG_TABLE.... > >2) load data infile.... > >3) create table URL_TABLE ( > > url_id bigint not null auto_increment, > > url varchar(25) not null primary key, > > category .... > > ) > >4) INSERT IGNORE URL_TABLE (url, category) > > SELECT url,category > > FROM BIG_TABLE > >4) create table USER_TABLE ( > > user_id varchar?(...) primary key, > > date_time datetime, > > url_id bigint, > > size int > > ) > > > >5) insert USER_TABLE > > SELECT bt.userID, bt.datetime, u.url_id, bt.size > > FROM BIG_TABLE bt > > INNER JOIN URL_TABLE u > > ON u.url = bt.url > > > >doing it this way lets the SQL engine handle the conversion of URLs to > >their IDs in the USER_TABLE...see? No scripting required at all! > > > >Yours, > >Shawn Green > >Database Administrator > >Unimin Corporation - Spruce Pine > > > > > >Shawn, > >Many thanks for your reply. It sounds like this might be quicker than my >perl script which parses a 1GB log file and fills the database in 40 mins >at >the mo. (The actual parsing only takes 1.5 mins). >There's one snag which I forgot about and that's in the url_table I have >another column called hits which is the number of hits for each url. I'm >terrible at SQL and wondered if you might be able to suggest a way of doing > >this with the method above? > >Thanks, > >JS. > >_________________________________________________________________ >It's fast, it's easy and it's free. Get MSN Messenger today! >http://www.msn.co.uk/messenger > > > > > > > >-- >MySQL General Mailing List >For list archives: http://lists.mysql.com/mysql >To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] > _________________________________________________________________ Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]