-------- welcome to a basic overview of bulk importing and normalizing as you go ------------
[ author's note: if you are seeing this thread for the first time and certain items seem to be introduced out of context, please review all previous posts in this thread. There has been a lot of information already exchanged on this topic that I clipped out of this response. Thanks! -- SG] In an earlier post you said >The data values I have for each record are: > >user_id date_time size url category >for example: > >u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business > To me that implies that you have a standard text log where each field is separated by a space and rows are delimited by a CRLF pair. To get that log into MySQL you will need to do something _like_ (I say "like" as you will most likely need to tweak it to accommodate your actual data) CREATE TABLE bulk_table ( user varchar(10) , eventdate date , eventtime time , size int , url text , category varchar(50) ); LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\r\n'; Now we can start merging the logs into your data tables. To quote RFC 2396: The URI syntax does not require that the scheme-specific-part have any general structure or set of semantics which is common among all URI. However, a subset of URI do share a common syntax for representing hierarchical relationships within the namespace. This "generic URI" syntax consists of a sequence of four main components: <scheme>://<authority><path>?<query> each of which, except <scheme>, may be absent from a particular URI. That translates into 4 logical pieces we can split a URL into: 1) the scheme -- HTTP, FTP, GOPHER, etc... 2) the authority -- that is the server (www.yahoo.com) and any port numbers or login information 3) the path -- /somefolder/somefile.whatever 4) the query -- everything after the ? Not part of the generic URI is that bit known as a fragment (as identified by the #) it is ALWAYS at the end of the _entire URL_ (including the query) when it's used. I have always lumped those and queries into the same field. So you could create tables for each of those parts and get VERY normalized or you can partly normalize like this: ALTER TABLE bulk_table add server_split int not null default 0 , add path_split int not null default 0 , add server varchar(255) , add path varchar(255) UPDATE bulk_table set server_split = LOCATE('/', URL , 8)-1; UPDATE bulk_table SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL, server_split), LOCATE('#', URL, server_split)-1); UPDATE bulk_table set server=LEFT(URL, server_split ) #those 4 new columns helped us to parse out the 3 major parts of the url #I added them to the table so that we would not have to keep recalculating those values later on # if it turns out that adding the columns takes a LONG time, we can create this table with those columns # already created and just not import to them (change the LOAD DATA INFILE statement slightly) CREATE TABLE IF NOT EXISTS url_servers ( ID int not null auto_increment, server varchar(255) primary key, Key (ID) ) CREATE TABLE IF NOT EXISTS url_paths ( ID int not null auto_increment, path varchar(255) primary key, Key (ID) ) INSERT IGNORE INTO url_servers(server) SELECT DISTINCT server FROM bulk_table INSERT IGNORE INTO url_paths (path) SELECT DISTINCT path FROM bulk_table # at this point we have all of our new Servers and our Paths uniquely numbered # but we are going to need a slightly different URL table to track visits. CREATE TABLE url_visit ( urlid mediumint not null auto_increment primary key, url_server_ID int not null default 0, url_path_ID int not null default 0, querystring text default null, category varchar(50)default null, KEY(url_server_ID, url_path_ID) ) ## that last key is to speed up our joins to our _servers and _paths tables... # we finally have enough information to insert to the visit table INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path ## (see where the new pre-computed columns come in handy?) :-D # and now we have enough information to load the internet_usage table. Though there isn't enough data in your sample # source data to fill in all of the columns INSERT internet_usage ( uid,`time`,urlid, size) SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_visit uv ON uv.url_server_ID = us.ID AND uv.url_path_ID = up.id AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path), NULL) It may not be perfect but it's how I would do it. I am sure that others out there would have a twist or two to put on the process. When it's all over and done, clear out the bulk_table and do it all over again. I would not mix a statistic, like hit count, into the url_visit table. This next query should be blazingly fast as both columns are part of a compound index. SELECT url_server_ID, url_path_ID, count(1) from url_visit group by url_server_ID,url_path_Id; -OR- if you would rather see full names: create temporary table tmpStats SELECT url_server_ID, url_path_ID, count(1) as hits from url_visit group by url_server_ID,url_path_Id; select concat(us.server,up.path), ts.hits from tmpStats inner join url_servers us on us.id = ts.url_server_ID inner join url_paths up on up.id = ts.url_path_ID drop table tmpStats; (NOTE: I used a temp table to aggregate on the ID values before resolving to the names because resolving those BEFORE running the GROUP BY would have eliminated the use of an existing index.) Well, I have to get back to work. Let me know how this works for you, OK? Cheers, Shawn Green Database Administrator Unimin Corporation - Spruce Pine ------------all previous responses clipped for space --------------------- -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]