Great catch! I believe you found the problem. Sorry about that! ;-D So... just curious... from the original 60GB of text data, how much space did the final data require (including indices)?
Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine "J S" <[EMAIL PROTECTED] To: [EMAIL PROTECTED] com> cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. > >Did you mean there to be duplicates in the url_visits? Do I need to use >IGNORE in the following SQL? > >INSERT url_visit (url_server_ID, url_path_ID, querystring, >category) >SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), >NULL), >bt.category >FROM bulk_table bt >INNER JOIN url_servers us > ON us.server = bt.server >INNER JOIN url_paths up > on up.path=bt.path > >js. > >mysql> select * from url_visit where urlid=1631; >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ >| urlid | url_server_ID | url_path_ID | query > > | category | >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ >| 1631 | 21720 | 630695 | >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift >| Online Sales | >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ >1 row in set (0.01 sec) > >mysql> select * from url_visit where urlid=1753; >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ >| urlid | url_server_ID | url_path_ID | query > > | category | >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ >| 1753 | 21720 | 630695 | >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift >| Online Sales | >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ >1 row in set (0.00 sec) > >> >>Shawn, >> >>Thanks for your reply below. I found it extremely useful. I have followed >>your instructions and got good results up to the url_visits table. >> >>I have a perl script to parse the values out of the log. The log has >>3,770,246 lines and is gzipped. I then applied your SQL statements with >>the following times: >> >>1. Parse script: 10m31 >>2. Load infile: 2m39 >>3. Insert url_servers and url_paths: 5m38 >>4. Insert url_visits 9m50 >> >>However when I ran the sql to insert the internet usage table, it just ran >>until all the log space was filled up: >> >>-rw-rw---- 1 mysql mysql 379694680 Jun 22 10:05 bulk_table.MYD >> <==== >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI >>-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm >>-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11 >>internet_usage.MYD <==== >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34 internet_usage.MYI >>-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59 internet_usage.frm >>-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD >>-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI >>-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm >>-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD >>-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI >>-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm >>-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD >>-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI >>-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm >> >>I'm sure something's not right because the internet Usage table is bigger >>than the bulk table. I changed your sql a bit (on the last line with the >>query). Could this be the problem? >> >>INSERT internet_usage (uid,time,ip,urlid,size) >>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size >>FROM bulk_table bt >>INNER JOIN url_servers us >> ON us.server=bt.server >>INNER JOIN url_paths up >> ON up.path=bt.path >>INNER JOIN url_visit uv >> ON uv.url_server_ID=us.ID >> AND uv.url_path_ID=up.ID >> AND uv.query=bt.query; >> >>Thanks again, >> >>js. >> >>> >>>-------- welcome to a basic overview of bulk importing and normalizing as >>>you go ------------ >>> >>>[ author's note: if you are seeing this thread for the first time and >>>certain items seem to be >>>introduced out of context, please review all previous posts in this >>>thread. >>>There has been >>>a lot of information already exchanged on this topic that I clipped >>>out of this response. Thanks! -- SG] >>> >>>In an earlier post you said >>> >The data values I have for each record are: >>> > >>> >user_id date_time size url category >>> >>> >for example: >>> > >>> >u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html >>>business >>> > >>> >>>To me that implies that you have a standard text log where each field is >>>separated by a space and rows are delimited by a CRLF pair. To get that >>>log >>>into MySQL you will need to do something _like_ (I say "like" as you will >>>most likely need to tweak it to accommodate your actual data) >>> >>>CREATE TABLE bulk_table ( >>> user varchar(10) >>> , eventdate date >>> , eventtime time >>> , size int >>> , url text >>> , category varchar(50) >>>); >>> >>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY >>>' >>>' LINES TERMINATED BY '\r\n'; >>> >>>Now we can start merging the logs into your data tables. >>> >>>To quote RFC 2396: >>> The URI syntax does not require that the scheme-specific-part have >>> any general structure or set of semantics which is common among all >>> URI. However, a subset of URI do share a common syntax for >>> representing hierarchical relationships within the namespace. This >>> "generic URI" syntax consists of a sequence of four main components: >>> >>> <scheme>://<authority><path>?<query> >>> >>> each of which, except <scheme>, may be absent from a particular URI. >>> >>>That translates into 4 logical pieces we can split a URL into: >>>1) the scheme -- HTTP, FTP, GOPHER, etc... >>>2) the authority -- that is the server (www.yahoo.com) and any port >>>numbers >>>or login information >>>3) the path -- /somefolder/somefile.whatever >>>4) the query -- everything after the ? >>> >>>Not part of the generic URI is that bit known as a fragment (as >>>identified >>>by the #) it is ALWAYS at the end of the _entire URL_ (including the >>>query) >>>when it's used. I have always lumped those and queries into the same >>>field. >>> >>>So you could create tables for each of those parts and get VERY >>>normalized >>>or you can partly normalize like this: >>> >>>ALTER TABLE bulk_table add server_split int not null default 0 >>> , add path_split int not null default 0 >>> , add server varchar(255) >>> , add path varchar(255) >>> >>>UPDATE bulk_table >>>set server_split = LOCATE('/', URL , 8)-1; >>> >>>UPDATE bulk_table >>>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL, >>>server_split), LOCATE('#', URL, server_split)-1); >>> >>>UPDATE bulk_table >>>set server=LEFT(URL, server_split ) >>> >>>#those 4 new columns helped us to parse out the 3 major parts of the url >>>#I added them to the table so that we would not have to keep >>>recalculating >>>those values later on >>># if it turns out that adding the columns takes a LONG time, we can >>>create >>>this table with those columns >>># already created and just not import to them (change the LOAD DATA >>>INFILE >>>statement slightly) >>> >>>CREATE TABLE IF NOT EXISTS url_servers ( >>>ID int not null auto_increment, >>>server varchar(255) primary key, >>>Key (ID) >>>) >>> >>>CREATE TABLE IF NOT EXISTS url_paths ( >>>ID int not null auto_increment, >>>path varchar(255) primary key, >>>Key (ID) >>>) >>> >>>INSERT IGNORE INTO url_servers(server) >>>SELECT DISTINCT server >>>FROM bulk_table >>> >>>INSERT IGNORE INTO url_paths (path) >>>SELECT DISTINCT path >>>FROM bulk_table >>> >>># at this point we have all of our new Servers and our Paths uniquely >>>numbered >>># but we are going to need a slightly different URL table to track >>>visits. >>> >>>CREATE TABLE url_visit ( >>>urlid mediumint not null auto_increment primary key, >>>url_server_ID int not null default 0, >>>url_path_ID int not null default 0, >>>querystring text default null, >>>category varchar(50)default null, >>>KEY(url_server_ID, url_path_ID) >>>) >>> >>>## that last key is to speed up our joins to our _servers and _paths >>>tables... >>> >>># we finally have enough information to insert to the visit table >>> >>>INSERT url_visit (url_server_ID, url_path_ID, querystring, category) >>>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL), >>>bt.category >>>FROM bulk_table bt >>>INNER JOIN url_servers us >>> ON us.server = bt.server >>>INNER JOIN url_paths up >>> on up.path=bt.path >>> >>>## (see where the new pre-computed columns come in handy?) :-D >>> >>># and now we have enough information to load the internet_usage table. >>>Though there isn't enough data in your sample >>># source data to fill in all of the columns >>> >>>INSERT internet_usage ( uid,`time`,urlid, size) >>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size >>>FROM bulk_table bt >>>INNER JOIN url_servers us >>> ON us.server = bt.server >>>INNER JOIN url_paths up >>> ON up.path=bt.path >>>INNER JOIN url_visit uv >>> ON uv.url_server_ID = us.ID >>> AND uv.url_path_ID = up.id >>> AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path), >>>NULL) >>> >>> >>>It may not be perfect but it's how I would do it. I am sure that others >>>out >>>there would have a twist or two to put on the process. When it's all >>>over >>>and done, clear out the bulk_table and do it all over again. >>> >>>I would not mix a statistic, like hit count, into the url_visit table. >>>This >>>next query should be blazingly fast as both columns are part of a >>>compound >>>index. >>> >>>SELECT url_server_ID, url_path_ID, count(1) >>>from url_visit >>>group by url_server_ID,url_path_Id; >>> >>>-OR- if you would rather see full names: >>> >>>create temporary table tmpStats >>>SELECT url_server_ID, url_path_ID, count(1) as hits >>>from url_visit >>>group by url_server_ID,url_path_Id; >>> >>>select concat(us.server,up.path), ts.hits >>>from tmpStats >>>inner join url_servers us >>>on us.id = ts.url_server_ID >>>inner join url_paths up >>>on up.id = ts.url_path_ID >>> >>>drop table tmpStats; >>> >>>(NOTE: I used a temp table to aggregate on the ID values before resolving >>>to the names because resolving those BEFORE running the GROUP BY would >>>have >>>eliminated the use of an existing index.) >>> >>>Well, I have to get back to work. Let me know how this works for you, OK? >>> >>>Cheers, >>>Shawn Green >>>Database Administrator >>>Unimin Corporation - Spruce Pine >>> >>> >>>------------all previous responses clipped for space >>>--------------------- >>> >>> -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]