Re: load data into 2 tables and set id
Hi Shawn, I wondered if you might be able to help me with an SQL query. I want to list all the internet sites I've surfed in my database. Here's a query that matches the url with a urlid: SELECT concat(usc.scheme,://,us.server,up.path) FROM url_visit uv INNER JOIN url_servers us ON us.id=uv.url_server_ID INNER JOIN url_paths up ON up.id=uv.url_path_ID INNER JOIN url_queries uq ON uq.id=uv.url_query_ID INNER JOIN url_schemes usc ON usc.id=uv.url_scheme_ID AND uv.urlid=10023; And that works really quickly. However I want to run this for each urlid matching my uid in table internet_usage. mysql desc internet_usage; +---+--+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-+---+ | uid | varchar(10) | YES | MUL | NULL| | | time | datetime | YES | | NULL| | | ip| bigint(20) | YES | | 0 | | | urlid | mediumint(9) | | | 0 | | | size | int(11) | YES | | 0 | | +---+--+--+-+-+---+ So maybe it's something like: SELECT iu.time, INET_NTOA(iu.ip), concat(usc.scheme,://,us.server,up.path) FROM internet_usage us, url_visit uv WHERE iu.uid=u752352 INNER JOIN url_servers us ON us.id=uv.url_server_ID INNER JOIN url_paths up ON up.id=uv.url_path_ID INNER JOIN url_queries uq ON uq.id=uv.url_query_ID INNER JOIN url_schemes usc ON usc.id=uv.url_scheme_ID But I haven't tried this because my tables are very big and I didn't want to do the wrong join! Thanks for any help you can offer. js. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
Figured it out! Took a gamble and run the below command! SELECT iu.time, INET_NTOA(iu.ip), concat(usc.scheme,://,us.server,up.path) FROM url_visit uv INNER JOIN internet_usage iu ON iu.urlid=uv.urlid INNER JOIN url_servers us ON us.id=uv.url_server_ID INNER JOIN url_paths up ON up.id=uv.url_path_ID INNER JOIN url_queries uq ON uq.id=uv.url_query_ID INNER JOIN url_schemes usc ON usc.id=uv.url_scheme_ID WHERE iu.uid=u752359; Hi Shawn, I wondered if you might be able to help me with an SQL query. I want to list all the internet sites I've surfed in my database. Here's a query that matches the url with a urlid: SELECT concat(usc.scheme,://,us.server,up.path) FROM url_visit uv INNER JOIN url_servers us ON us.id=uv.url_server_ID INNER JOIN url_paths up ON up.id=uv.url_path_ID INNER JOIN url_queries uq ON uq.id=uv.url_query_ID INNER JOIN url_schemes usc ON usc.id=uv.url_scheme_ID AND uv.urlid=10023; And that works really quickly. However I want to run this for each urlid matching my uid in table internet_usage. mysql desc internet_usage; +---+--+--+-+-+---+ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-+---+ | uid | varchar(10) | YES | MUL | NULL| | | time | datetime | YES | | NULL| | | ip| bigint(20) | YES | | 0 | | | urlid | mediumint(9) | | | 0 | | | size | int(11) | YES | | 0 | | +---+--+--+-+-+---+ So maybe it's something like: SELECT iu.time, INET_NTOA(iu.ip), concat(usc.scheme,://,us.server,up.path) FROM internet_usage us, url_visit uv WHERE iu.uid=u752352 INNER JOIN url_servers us ON us.id=uv.url_server_ID INNER JOIN url_paths up ON up.id=uv.url_path_ID INNER JOIN url_queries uq ON uq.id=uv.url_query_ID INNER JOIN url_schemes usc ON usc.id=uv.url_scheme_ID But I haven't tried this because my tables are very big and I didn't want to do the wrong join! Thanks for any help you can offer. js. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED] _ Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
Shawn, I uncovered a problem this morning. I wonder if you (or anyone else) can help me out again? mysql select * from url_visit where url_scheme_ID=3 limit 10; +-+---+---+-+---+--+ | urlid | url_scheme_ID | url_server_ID | url_path_ID | query | category | +-+---+---+-+---+--+ | 23392 | 3 | 1070 | 1 | NULL | none | | 1346269 | 3 | 1070 | 1 | NULL | none | +-+---+---+-+---+--+ 2 rows in set (0.00 sec) This is the insert statement I'm using: INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query, category) SELECT DISTINCT uc.ID, us.ID, up.ID,bt.query,bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_schemes uc ON uc.scheme=bt.scheme Do I need brackets after the distinct? e.g SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category) Thanks, js. Great catch! I believe you found the problem. Sorry about that! ;-D So... just curious... from the original 60GB of text data, how much space did the final data require (including indices)? Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql select * from url_visit where urlid=1631; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1631 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.01 sec) mysql select * from url_visit where urlid=1753; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1753 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.00 sec) Shawn, Thanks for your reply below. I found it extremely useful. I have followed your instructions and got good results up to the url_visits table. I have a perl script to parse the values out of the log. The log has 3,770,246 lines and is gzipped. I then applied your SQL statements with the following times: 1. Parse script: 10m31 2. Load infile: 2m39 3. Insert url_servers and url_paths: 5m38 4. Insert url_visits 9m50 However when I ran the sql to insert the internet usage table, it just ran until all the log space was filled up: -rw-rw 1 mysqlmysql379694680 Jun 22 10:05 bulk_table.MYD
Re: load data into 2 tables and set id
J S, Check to see if the url_scheme_ID part of a unique constraint/key or the Primary Key? If it isn't, you will get dupes even if you use INSERT IGNORE unless you pre-screen your INSERTs some other way to avoid duplication. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/23/2004 04:13 Subject: Re: load data into 2 tables and set id AM Shawn, I uncovered a problem this morning. I wonder if you (or anyone else) can help me out again? mysql select * from url_visit where url_scheme_ID=3 limit 10; +-+---+---+-+---+--+ | urlid | url_scheme_ID | url_server_ID | url_path_ID | query | category | +-+---+---+-+---+--+ | 23392 | 3 | 1070 | 1 | NULL | none | | 1346269 | 3 | 1070 | 1 | NULL | none | +-+---+---+-+---+--+ 2 rows in set (0.00 sec) This is the insert statement I'm using: INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query, category) SELECT DISTINCT uc.ID, us.ID, up.ID,bt.query,bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_schemes uc ON uc.scheme=bt.scheme Do I need brackets after the distinct? e.g SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category) Thanks, js. Great catch! I believe you found the problem. Sorry about that! ;-D So... just curious... from the original 60GB of text data, how much space did the final data require (including indices)? Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql select * from url_visit where urlid=1631; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1631 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.01 sec
Re: load data into 2 tables and set id
Hi Shawn, Here's the url_Schemes table (it's the same as the url_paths and url_servers). This means url_scheme_ID is part of a unique constraint/key ? mysql desc url_schemes; ++-+--+-+-++ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-++ | ID | int(11) | | MUL | NULL| auto_increment | | scheme | varchar(20) | | PRI | || ++-+--+-+-++ 2 rows in set (0.00 sec) mysql desc url_visit; +---+--+--+-+-++ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-++ | urlid | mediumint(9) | | PRI | NULL| auto_increment | | url_scheme_ID | int(11) | | MUL | 0 || | url_server_ID | int(11) | | | 0 || | url_path_ID | int(11) | | | 0 || | query | text | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +---+--+--+-+-++ 6 rows in set (0.00 sec) mysql select * from url_schemes; +++ | ID | scheme | +++ | 1 | http | | 2 | tcp| | 3 | - | | 4 | ftp| | 5 | https | +++ 5 rows in set (0.00 sec) J S, Check to see if the url_scheme_ID part of a unique constraint/key or the Primary Key? If it isn't, you will get dupes even if you use INSERT IGNORE unless you pre-screen your INSERTs some other way to avoid duplication. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/23/2004 04:13 Subject: Re: load data into 2 tables and set id AM Shawn, I uncovered a problem this morning. I wonder if you (or anyone else) can help me out again? mysql select * from url_visit where url_scheme_ID=3 limit 10; +-+---+---+-+---+--+ | urlid | url_scheme_ID | url_server_ID | url_path_ID | query | category | +-+---+---+-+---+--+ | 23392 | 3 | 1070 | 1 | NULL | none | | 1346269 | 3 | 1070 | 1 | NULL | none | +-+---+---+-+---+--+ 2 rows in set (0.00 sec) This is the insert statement I'm using: INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query, category) SELECT DISTINCT uc.ID, us.ID, up.ID,bt.query,bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_schemes uc ON uc.scheme=bt.scheme Do I need brackets after the distinct? e.g SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category) Thanks, js. Great catch! I believe you found the problem. Sorry about that! ;-D So... just curious... from the original 60GB of text data, how much space did the final data require (including indices)? Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql select * from
Re: load data into 2 tables and set id
No, url_scheme_ID has key type MUL, which means that that multiple occurences of a given value are allowed within the field. To prevent duplicate entries in url_visit, decide which combination of columns should have no duplicates, then add a unique index on that combination. Michael J S wrote: Hi Shawn, Here's the url_Schemes table (it's the same as the url_paths and url_servers). This means url_scheme_ID is part of a unique constraint/key ? mysql desc url_schemes; ++-+--+-+-++ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-++ | ID | int(11) | | MUL | NULL| auto_increment | | scheme | varchar(20) | | PRI | || ++-+--+-+-++ 2 rows in set (0.00 sec) mysql desc url_visit; +---+--+--+-+-++ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-++ | urlid | mediumint(9) | | PRI | NULL| auto_increment | | url_scheme_ID | int(11) | | MUL | 0 || | url_server_ID | int(11) | | | 0 || | url_path_ID | int(11) | | | 0 || | query | text | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +---+--+--+-+-++ 6 rows in set (0.00 sec) mysql select * from url_schemes; +++ | ID | scheme | +++ | 1 | http | | 2 | tcp| | 3 | - | | 4 | ftp| | 5 | https | +++ 5 rows in set (0.00 sec) J S, Check to see if the url_scheme_ID part of a unique constraint/key or the Primary Key? If it isn't, you will get dupes even if you use INSERT IGNORE unless you pre-screen your INSERTs some other way to avoid duplication. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/23/2004 04:13 Subject: Re: load data into 2 tables and set id AM Shawn, I uncovered a problem this morning. I wonder if you (or anyone else) can help me out again? mysql select * from url_visit where url_scheme_ID=3 limit 10; +-+---+---+-+---+--+ | urlid | url_scheme_ID | url_server_ID | url_path_ID | query | category | +-+---+---+-+---+--+ | 23392 | 3 | 1070 | 1 | NULL | none | | 1346269 | 3 | 1070 | 1 | NULL | none | +-+---+---+-+---+--+ 2 rows in set (0.00 sec) This is the insert statement I'm using: INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query, category) SELECT DISTINCT uc.ID, us.ID, up.ID,bt.query,bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_schemes uc ON uc.scheme=bt.scheme Do I need brackets after the distinct? e.g SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category) Thanks, js. Great catch! I believe you found the problem. Sorry about that! ;-D So... just curious... from the original 60GB of text data, how much space did the final data require (including indices)? Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT
Re: load data into 2 tables and set id
I can't tell from a DESCRIBE output. Please post the results of SHOW CREATE TABLE url_visit; Thanks! Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/23/2004 09:57 Subject: Re: load data into 2 tables and set id AM Hi Shawn, Here's the url_Schemes table (it's the same as the url_paths and url_servers). This means url_scheme_ID is part of a unique constraint/key ? mysql desc url_schemes; ++-+--+-+-++ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-++ | ID | int(11) | | MUL | NULL| auto_increment | | scheme | varchar(20) | | PRI | || ++-+--+-+-++ 2 rows in set (0.00 sec) mysql desc url_visit; +---+--+--+-+-++ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-++ | urlid | mediumint(9) | | PRI | NULL| auto_increment | | url_scheme_ID | int(11) | | MUL | 0 || | url_server_ID | int(11) | | | 0 || | url_path_ID | int(11) | | | 0 || | query | text | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +---+--+--+-+-++ 6 rows in set (0.00 sec) mysql select * from url_schemes; +++ | ID | scheme | +++ | 1 | http | | 2 | tcp| | 3 | - | | 4 | ftp| | 5 | https | +++ 5 rows in set (0.00 sec) -- previous responses clipped for space - -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
Sorry! I'm not that experienced with databases as you probably realised! url_visit | CREATE TABLE `url_visit` ( `urlid` mediumint(9) NOT NULL auto_increment, `url_scheme_ID` int(11) NOT NULL default '0', `url_server_ID` int(11) NOT NULL default '0', `url_path_ID` int(11) NOT NULL default '0', `query` text, `category` varchar(50) default NULL, PRIMARY KEY (`urlid`), KEY `url_scheme_ID` (`url_scheme_ID`,`url_server_ID`,`url_path_ID`) ) TYPE=MyISAM I can't tell from a DESCRIBE output. Please post the results of SHOW CREATE TABLE url_visit; Thanks! Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/23/2004 09:57 Subject: Re: load data into 2 tables and set id AM Hi Shawn, Here's the url_Schemes table (it's the same as the url_paths and url_servers). This means url_scheme_ID is part of a unique constraint/key ? mysql desc url_schemes; ++-+--+-+-++ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-++ | ID | int(11) | | MUL | NULL| auto_increment | | scheme | varchar(20) | | PRI | || ++-+--+-+-++ 2 rows in set (0.00 sec) mysql desc url_visit; +---+--+--+-+-++ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-++ | urlid | mediumint(9) | | PRI | NULL| auto_increment | | url_scheme_ID | int(11) | | MUL | 0 || | url_server_ID | int(11) | | | 0 || | url_path_ID | int(11) | | | 0 || | query | text | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +---+--+--+-+-++ 6 rows in set (0.00 sec) mysql select * from url_schemes; +++ | ID | scheme | +++ | 1 | http | | 2 | tcp| | 3 | - | | 4 | ftp| | 5 | https | +++ 5 rows in set (0.00 sec) -- previous responses clipped for space - _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
Do this to prevent duplication on those three columns in the future: ALTER TABLE url_visit DROP KEY `url_scheme_ID`, add UNIQUE KEY (url_scheme_ID, url_server_ID, url_path_id); The way I have composed that key (table-column order), it will force you to include the url_scheme_ID if you want to use the index to find url_server_id and url_path_id. Maybe a better Idea is to organize that new key so that the columns are listed in their order of prevalence in your queries, if you search by url_sever_ID most often, list it first. If url_scheme_ID is not something you need as often put it last. That changes the statement to look like: ALTER TABLE url_visit DROP KEY `url_scheme_ID`, add UNIQUE KEY (url_server_ID, url_path_ID, url_scheme_ID); This way you can search on {url_server_ID}, {url_server_ID, url_path_ID}, or (url_server_ID, url_path_ID, url_scheme_ID} and MySQL will still use the index. Because it's designated as UNIQUE key, there will always be at most 1 record with any combination of those three values. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/23/2004 10:38 Subject: Re: load data into 2 tables and set id AM Sorry! I'm not that experienced with databases as you probably realised! url_visit | CREATE TABLE `url_visit` ( `urlid` mediumint(9) NOT NULL auto_increment, `url_scheme_ID` int(11) NOT NULL default '0', `url_server_ID` int(11) NOT NULL default '0', `url_path_ID` int(11) NOT NULL default '0', `query` text, `category` varchar(50) default NULL, PRIMARY KEY (`urlid`), KEY `url_scheme_ID` (`url_scheme_ID`,`url_server_ID`,`url_path_ID`) ) TYPE=MyISAM I can't tell from a DESCRIBE output. Please post the results of SHOW CREATE TABLE url_visit; Thanks! Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/23/2004 09:57 Subject: Re: load data into 2 tables and set id AM Hi Shawn, Here's the url_Schemes table (it's the same as the url_paths and url_servers). This means url_scheme_ID is part of a unique constraint/key ? mysql desc url_schemes; ++-+--+-+-++ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-++ | ID | int(11) | | MUL | NULL| auto_increment | | scheme | varchar(20) | | PRI | || ++-+--+-+-++ 2 rows in set (0.00 sec) mysql desc url_visit; +---+--+--+-+-++ | Field | Type | Null | Key | Default | Extra | +---+--+--+-+-++ | urlid | mediumint(9) | | PRI | NULL| auto_increment | | url_scheme_ID | int(11) | | MUL | 0 || | url_server_ID | int(11) | | | 0 || | url_path_ID | int(11) | | | 0 || | query | text | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +---+--+--+-+-++ 6 rows in set (0.00 sec) mysql select * from url_schemes; +++ | ID | scheme | +++ | 1 | http | | 2 | tcp| | 3 | - | | 4 | ftp| | 5 | https | +++ 5 rows in set (0.00 sec) -- previous responses clipped for space - _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger -- MySQL
Re: load data into 2 tables and set id
Shawn, Thanks for your reply below. I found it extremely useful. I have followed your instructions and got good results up to the url_visits table. I have a perl script to parse the values out of the log. The log has 3,770,246 lines and is gzipped. I then applied your SQL statements with the following times: 1. Parse script: 10m31 2. Load infile: 2m39 3. Insert url_servers and url_paths: 5m38 4. Insert url_visits 9m50 However when I ran the sql to insert the internet usage table, it just ran until all the log space was filled up: -rw-rw 1 mysqlmysql379694680 Jun 22 10:05 bulk_table.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:05 bulk_table.MYI -rw-rw 1 mysqlmysql 8760 Jun 22 09:59 bulk_table.frm -rw-rw 1 mysqlmysql2114977792 Jun 22 11:11 internet_usage.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:34 internet_usage.MYI -rw-rw 1 mysqlmysql 8646 Jun 22 09:59 internet_usage.frm -rw-rw 1 mysqlmysql33437600 Jun 22 10:23 url_paths.MYD -rw-rw 1 mysqlmysql27696128 Jun 22 10:23 url_paths.MYI -rw-rw 1 mysqlmysql 8574 Jun 22 09:59 url_paths.frm -rw-rw 1 mysqlmysql 646160 Jun 22 10:19 url_servers.MYD -rw-rw 1 mysqlmysql 796672 Jun 22 10:19 url_servers.MYI -rw-rw 1 mysqlmysql 8578 Jun 22 09:59 url_servers.frm -rw-rw 1 mysqlmysql119076844 Jun 22 10:32 url_visit.MYD -rw-rw 1 mysqlmysql73026560 Jun 22 10:33 url_visit.MYI -rw-rw 1 mysqlmysql 8694 Jun 22 09:59 url_visit.frm I'm sure something's not right because the internet Usage table is bigger than the bulk table. I changed your sql a bit (on the last line with the query). Could this be the problem? INSERT internet_usage (uid,time,ip,urlid,size) SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_visit uv ON uv.url_server_ID=us.ID AND uv.url_path_ID=up.ID AND uv.query=bt.query; Thanks again, js. welcome to a basic overview of bulk importing and normalizing as you go [ author's note: if you are seeing this thread for the first time and certain items seem to be introduced out of context, please review all previous posts in this thread. There has been a lot of information already exchanged on this topic that I clipped out of this response. Thanks! -- SG] In an earlier post you said The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business To me that implies that you have a standard text log where each field is separated by a space and rows are delimited by a CRLF pair. To get that log into MySQL you will need to do something _like_ (I say like as you will most likely need to tweak it to accommodate your actual data) CREATE TABLE bulk_table ( user varchar(10) , eventdate date , eventtime time , size int , url text , category varchar(50) ); LOAD DATA INFILE a proxy log file INTO bulk_table FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\r\n'; Now we can start merging the logs into your data tables. To quote RFC 2396: The URI syntax does not require that the scheme-specific-part have any general structure or set of semantics which is common among all URI. However, a subset of URI do share a common syntax for representing hierarchical relationships within the namespace. This generic URI syntax consists of a sequence of four main components: scheme://authoritypath?query each of which, except scheme, may be absent from a particular URI. That translates into 4 logical pieces we can split a URL into: 1) the scheme -- HTTP, FTP, GOPHER, etc... 2) the authority -- that is the server (www.yahoo.com) and any port numbers or login information 3) the path -- /somefolder/somefile.whatever 4) the query -- everything after the ? Not part of the generic URI is that bit known as a fragment (as identified by the #) it is ALWAYS at the end of the _entire URL_ (including the query) when it's used. I have always lumped those and queries into the same field. So you could create tables for each of those parts and get VERY normalized or you can partly normalize like this: ALTER TABLE bulk_table add server_split int not null default 0 , add path_split int not null default 0 , add server varchar(255) , add path varchar(255) UPDATE bulk_table set server_split = LOCATE('/', URL , 8)-1; UPDATE bulk_table SET path_split = if(LOCATE('?', URL, server_split) 0, LOCATE('?', URL, server_split), LOCATE('#', URL, server_split)-1); UPDATE bulk_table set server=LEFT(URL, server_split ) #those 4 new columns helped us to parse out the 3 major parts of the url #I added them to the table so
Re: load data into 2 tables and set id
Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql select * from url_visit where urlid=1631; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1631 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.01 sec) mysql select * from url_visit where urlid=1753; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1753 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.00 sec) Shawn, Thanks for your reply below. I found it extremely useful. I have followed your instructions and got good results up to the url_visits table. I have a perl script to parse the values out of the log. The log has 3,770,246 lines and is gzipped. I then applied your SQL statements with the following times: 1. Parse script: 10m31 2. Load infile: 2m39 3. Insert url_servers and url_paths: 5m38 4. Insert url_visits 9m50 However when I ran the sql to insert the internet usage table, it just ran until all the log space was filled up: -rw-rw 1 mysqlmysql379694680 Jun 22 10:05 bulk_table.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:05 bulk_table.MYI -rw-rw 1 mysqlmysql 8760 Jun 22 09:59 bulk_table.frm -rw-rw 1 mysqlmysql2114977792 Jun 22 11:11 internet_usage.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:34 internet_usage.MYI -rw-rw 1 mysqlmysql 8646 Jun 22 09:59 internet_usage.frm -rw-rw 1 mysqlmysql33437600 Jun 22 10:23 url_paths.MYD -rw-rw 1 mysqlmysql27696128 Jun 22 10:23 url_paths.MYI -rw-rw 1 mysqlmysql 8574 Jun 22 09:59 url_paths.frm -rw-rw 1 mysqlmysql 646160 Jun 22 10:19 url_servers.MYD -rw-rw 1 mysqlmysql 796672 Jun 22 10:19 url_servers.MYI -rw-rw 1 mysqlmysql 8578 Jun 22 09:59 url_servers.frm -rw-rw 1 mysqlmysql119076844 Jun 22 10:32 url_visit.MYD -rw-rw 1 mysqlmysql73026560 Jun 22 10:33 url_visit.MYI -rw-rw 1 mysqlmysql 8694 Jun 22 09:59 url_visit.frm I'm sure something's not right because the internet Usage table is bigger than the bulk table. I changed your sql a bit (on the last line with the query). Could this be the problem? INSERT internet_usage (uid,time,ip,urlid,size) SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_visit uv ON uv.url_server_ID=us.ID AND uv.url_path_ID=up.ID AND uv.query=bt.query; Thanks again, js. welcome to a basic overview of bulk importing and normalizing as you go [ author's note: if you are seeing this thread for the first time and certain items seem to be introduced out of context, please review all previous posts in this thread. There has been a lot of information already exchanged on this topic that I clipped out of this response. Thanks! -- SG] In an earlier post you said The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403
Re: load data into 2 tables and set id
I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql select * from url_visit where urlid=1631; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1631 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.01 sec) mysql select * from url_visit where urlid=1753; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1753 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.00 sec) Shawn, Thanks for your reply below. I found it extremely useful. I have followed your instructions and got good results up to the url_visits table. I have a perl script to parse the values out of the log. The log has 3,770,246 lines and is gzipped. I then applied your SQL statements with the following times: 1. Parse script: 10m31 2. Load infile: 2m39 3. Insert url_servers and url_paths: 5m38 4. Insert url_visits 9m50 However when I ran the sql to insert the internet usage table, it just ran until all the log space was filled up: -rw-rw 1 mysqlmysql379694680 Jun 22 10:05 bulk_table.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:05 bulk_table.MYI -rw-rw 1 mysqlmysql 8760 Jun 22 09:59 bulk_table.frm -rw-rw 1 mysqlmysql2114977792 Jun 22 11:11 internet_usage.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:34 internet_usage.MYI -rw-rw 1 mysqlmysql 8646 Jun 22 09:59 internet_usage.frm -rw-rw 1 mysqlmysql33437600 Jun 22 10:23 url_paths.MYD -rw-rw 1 mysqlmysql27696128 Jun 22 10:23 url_paths.MYI -rw-rw 1 mysqlmysql 8574 Jun 22 09:59 url_paths.frm -rw-rw 1 mysqlmysql 646160 Jun 22 10:19 url_servers.MYD -rw-rw 1 mysqlmysql 796672 Jun 22 10:19 url_servers.MYI -rw-rw 1 mysqlmysql 8578 Jun 22 09:59 url_servers.frm -rw-rw 1 mysqlmysql119076844 Jun 22 10:32 url_visit.MYD -rw-rw 1 mysqlmysql73026560 Jun 22 10:33 url_visit.MYI -rw-rw 1 mysqlmysql 8694 Jun 22 09:59 url_visit.frm I'm sure something's not right because the internet Usage table is bigger than the bulk table. I changed your sql a bit (on the last line with the query). Could this be the problem? INSERT internet_usage (uid,time,ip,urlid,size) SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_visit uv ON uv.url_server_ID=us.ID AND uv.url_path_ID=up.ID AND uv.query=bt.query; Thanks again, js. welcome to a basic overview of bulk importing and normalizing as you go
Re: load data into 2 tables and set id
Great catch! I believe you found the problem. Sorry about that! ;-D So... just curious... from the original 60GB of text data, how much space did the final data require (including indices)? Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql select * from url_visit where urlid=1631; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1631 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.01 sec) mysql select * from url_visit where urlid=1753; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1753 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.00 sec) Shawn, Thanks for your reply below. I found it extremely useful. I have followed your instructions and got good results up to the url_visits table. I have a perl script to parse the values out of the log. The log has 3,770,246 lines and is gzipped. I then applied your SQL statements with the following times: 1. Parse script: 10m31 2. Load infile: 2m39 3. Insert url_servers and url_paths: 5m38 4. Insert url_visits 9m50 However when I ran the sql to insert the internet usage table, it just ran until all the log space was filled up: -rw-rw 1 mysqlmysql379694680 Jun 22 10:05 bulk_table.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:05 bulk_table.MYI -rw-rw 1 mysqlmysql 8760 Jun 22 09:59 bulk_table.frm -rw-rw 1 mysqlmysql2114977792 Jun 22 11:11 internet_usage.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 10:34 internet_usage.MYI -rw-rw 1 mysqlmysql 8646 Jun 22 09:59 internet_usage.frm -rw-rw 1 mysqlmysql33437600 Jun 22 10:23 url_paths.MYD -rw-rw 1 mysqlmysql
Re: load data into 2 tables and set id
Shawn, Here are the results: $ ls -l /sawmill/rawlog/SG_CSGL02_main_47060821.log.gz -rw-r- 1 bluecoat staff138510199 Jun 14 10:04 /sawmill/rawlog/SG_CSGL02_main_47060821.log.gz $ time ./logfile.pl /sawmill/rawlog/SG_CSGL02_main_47060821.log.gz Tue Jun 22 16:53:40 2004: PARSING /sawmill/rawlog/SG_CSGL02_main_47060821.log.gz... Tue Jun 22 17:03:37 2004: BULK_TABLE_INSERT... Tue Jun 22 17:05:41 2004: INSERT_URL_SCHEMES... Tue Jun 22 17:06:08 2004: INSERT_URL_SERVERS... Tue Jun 22 17:08:06 2004: INSERT_URL_PATHS... Tue Jun 22 17:11:20 2004: INSERT_URL_VISITS... Tue Jun 22 17:28:16 2004: INSERT_INTERNET_USAGE... Tue Jun 22 17:42:01 2004: Finished real47m16.68s user10m44.47s sys 0m17.95s $ # ls -l total 1206168 -rw-rw 1 mysqlmysql379692348 Jun 22 17:05 bulk_table.MYD -rw-rw 1 mysqlmysql 1024 Jun 22 17:05 bulk_table.MYI -rw-rw 1 mysqlmysql 8760 Jun 22 16:51 bulk_table.frm -rw-rw 1 mysqlmysql115443608 Jun 22 17:40 internet_usage.MYD -rw-rw 1 mysqlmysql19328000 Jun 22 17:42 internet_usage.MYI -rw-rw 1 mysqlmysql 8646 Jun 22 16:51 internet_usage.frm -rw-rw 1 mysqlmysql33437600 Jun 22 17:11 url_paths.MYD -rw-rw 1 mysqlmysql27696128 Jun 22 17:11 url_paths.MYI -rw-rw 1 mysqlmysql 8574 Jun 22 16:51 url_paths.frm -rw-rw 1 mysqlmysql 80 Jun 22 17:06 url_schemes.MYD -rw-rw 1 mysqlmysql 3072 Jun 22 17:06 url_schemes.MYI -rw-rw 1 mysqlmysql 8578 Jun 22 16:51 url_schemes.frm -rw-rw 1 mysqlmysql 646160 Jun 22 17:08 url_servers.MYD -rw-rw 1 mysqlmysql 796672 Jun 22 17:08 url_servers.MYI -rw-rw 1 mysqlmysql 8578 Jun 22 16:51 url_servers.frm -rw-rw 1 mysqlmysql24082472 Jun 22 17:28 url_visit.MYD -rw-rw 1 mysqlmysql16331776 Jun 22 17:28 url_visit.MYI -rw-rw 1 mysqlmysql 8736 Jun 22 16:51 url_visit.frm # du -sk . 603088 . I had a problem trying to get the hits results. It works through the mysql command line but if I try to run mysql hits.sql hits.txt it can't find the temp table tmpStats. js. Great catch! I believe you found the problem. Sorry about that! ;-D So... just curious... from the original 60GB of text data, how much space did the final data require (including indices)? Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM I think I fixed it! INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, category) SELECT DISTINCT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path After this I ran the sql to insert into internet_usage and this time it finished after 10m32. I'll do some more testing and let you know how it goes. Thanks, js. Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL? INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path js. mysql select * from url_visit where urlid=1631; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category | +---+---+-++--+ | 1631 | 21720 | 630695 | cid=mrkbakisrc=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales | +---+---+-++--+ 1 row in set (0.01 sec) mysql select * from url_visit where urlid=1753; +---+---+-++--+ | urlid | url_server_ID | url_path_ID | query | category
Re: load data into 2 tables and set id
Mos forgot to populate the url_id column in your user table. I would use his same process but re-arrange it like this: 1) create table BIG_TABLE 2) load data infile 3) create table URL_TABLE ( url_id bigint not null auto_increment, url varchar(25) not null primary key, category ) 4) INSERT IGNORE URL_TABLE (url, category) SELECT url,category FROM BIG_TABLE 4) create table USER_TABLE ( user_id varchar?(...) primary key, date_time datetime, url_id bigint, size int ) 5) insert USER_TABLE SELECT bt.userID, bt.datetime, u.url_id, bt.size FROM BIG_TABLE bt INNER JOIN URL_TABLE u ON u.url = bt.url doing it this way lets the SQL engine handle the conversion of URLs to their IDs in the USER_TABLE...see? No scripting required at all! Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine Shawn, Many thanks for your reply. It sounds like this might be quicker than my perl script which parses a 1GB log file and fills the database in 40 mins at the mo. (The actual parsing only takes 1.5 mins). There's one snag which I forgot about and that's in the url_table I have another column called hits which is the number of hits for each url. I'm terrible at SQL and wondered if you might be able to suggest a way of doing this with the method above? Thanks, JS. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
No problem!! Please post the structures of your big_table and your url_table (whatever you called them) and I will help you to rewrite step 4 to count how many times a URL appears in the big_table. Is this bulk import a process you need to do repeatedly? if so we need to worry about updating the count column on the next batch import and not just re-creating it from the new data. That will change the query significantly. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/13/2004 12:29 Subject: Re: load data into 2 tables and set id PM Mos forgot to populate the url_id column in your user table. I would use his same process but re-arrange it like this: 1) create table BIG_TABLE 2) load data infile 3) create table URL_TABLE ( url_id bigint not null auto_increment, url varchar(25) not null primary key, category ) 4) INSERT IGNORE URL_TABLE (url, category) SELECT url,category FROM BIG_TABLE 4) create table USER_TABLE ( user_id varchar?(...) primary key, date_time datetime, url_id bigint, size int ) 5) insert USER_TABLE SELECT bt.userID, bt.datetime, u.url_id, bt.size FROM BIG_TABLE bt INNER JOIN URL_TABLE u ON u.url = bt.url doing it this way lets the SQL engine handle the conversion of URLs to their IDs in the USER_TABLE...see? No scripting required at all! Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine Shawn, Many thanks for your reply. It sounds like this might be quicker than my perl script which parses a 1GB log file and fills the database in 40 mins at the mo. (The actual parsing only takes 1.5 mins). There's one snag which I forgot about and that's in the url_table I have another column called hits which is the number of hits for each url. I'm terrible at SQL and wondered if you might be able to suggest a way of doing this with the method above? Thanks, JS. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
Shawn, Thanks for helping on this. I really appreciate it. No problem!! Please post the structures of your big_table and your url_table (whatever you called them) and I will help you to rewrite step 4 to count how many times a URL appears in the big_table. mysql desc internet_usage; ++-+--+-+-+---+ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-+---+ | uid| varchar(10) | YES | MUL | NULL| | | time | datetime| YES | | NULL| | | ip | varchar(20) | YES | | NULL| | | action | varchar(20) | YES | | NULL| | | urlid | int(11) | YES | | NULL| | | size | int(11) | YES | | NULL| | ++-+--+-+-+---+ 6 rows in set (0.03 sec) mysql desc url_table; +--+--+--+-+-++ | Field| Type | Null | Key | Default | Extra| +--+--+--+-+-++ | urlid| mediumint(9) | | PRI | NULL| auto_increment | | url | text | YES | MUL | NULL|| | hits | mediumint(9) | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +--+--+--+-+-++ 4 rows in set (0.01 sec) mysql Is this bulk import a process you need to do repeatedly? if so we need to worry about updating the count column on the next batch import and not just re-creating it from the new data. That will change the query significantly. Yes it will be a repeated process. Actually I have a backlog of 6 months data to load! Here's an example of what the data looks like: mysql select * from internet_usage limit 5; +-+-++--+---+--+ | uid | time| ip | action | urlid | size | +-+-++--+---+--+ | n58396 | 2004-06-07 21:12:16 | 21.38.25.204 | TCP_TUNNELED | 5999 | 5297 | | u344584 | 2004-06-07 21:07:12 | 21.33.136.74 | TCP_HIT | 4494 | 438 | | - | 2004-06-07 21:07:02 | 21.38.92.76| TCP_NC_MISS | 2134 | 771 | | u524797 | 2004-06-07 21:03:27 | 21.32.25.41| TCP_NC_MISS | 260 | 582 | | - | 2004-06-07 21:09:13 | 21.201.130.240 | TCP_HIT | 3112 | 542 | +-+-++--+---+--+ mysql select * from url_table limit 5; +---+---+--+---+ | urlid | url | hits | category | +---+---+--+---+ | 1 | http://www.bbc.co.uk/horoscopes/chinesehoroscopes/images/hp-snake.gif | NULL | Entertainment | | 2 | http://www.call18866.co.uk/images/logo.jpg | NULL | none | | 3 | http://m2.doubleclick.net/866421/0409_santoku_250.gif | NULL | none | | 4 | http://lysto1-dc02.ww.ad.ba.com/ | NULL | Travel| | 5 | http://www.aboutscotland.com/edin/newstreetph/sitview.jpg | NULL | Travel| +---+---+--+---+ 5 rows in set (0.00 sec) One other problem I'm having here is making the rows in internet_usage unique. At the moment I have lots of duplicates, and I was trying to create a temporary table but unfortunately got an error 27 (I think this refers to a 2GB limit). mysql CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM internet_usage; ERROR 1030: Got error 27 from table handler Is there another way of doing this? Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine Mos forgot to populate the url_id column in your user table. I would use his same process but re-arrange it like this: 1) create table BIG_TABLE 2) load data infile 3) create table URL_TABLE ( url_id bigint not null auto_increment, url varchar(25) not null primary key, category ) 4) INSERT IGNORE URL_TABLE (url, category) SELECT url,category FROM BIG_TABLE 4) create table USER_TABLE ( user_id varchar?(...) primary key, date_time datetime, url_id bigint, size int ) 5) insert USER_TABLE SELECT bt.userID, bt.datetime, u.url_id, bt.size FROM BIG_TABLE bt INNER JOIN URL_TABLE u ON u.url = bt.url doing it this way lets the SQL engine handle the conversion of URLs to their IDs in
Re: load data into 2 tables and set id
JS, I need one more piece of information to help make that query work for you, I need the structure of the table that you use (or would use) to bulk import those logs into. If you are running out of room you may consider further normalizing you data (which saves space, but creates more data maintenance steps). I, too, have had to deal with millions of rows of internet usage logs so I understand your pain. You can store your IP addresses in an INT and get them back in dotted notation with the MySQL functions INET_ATON() and INET_NTOA(). That will save you an average of 10 bytes PER ROW (it adds up when you are into the millions of rows). Create a Proxy_Action table (id tinyint, action varchar(20) ) and populate it with all of the actions your proxy/firewall can log. Then replace the column action varchar(20) with ProxyAction_id tinyint. (I assume there are less than 256 action messages available from your proxy?) That's 1 byte vs. 8 at the low end of your sample data (counting the null at the end of the string). First add the new column to the table, populate it with the ID values from your new Action table, then drop the old column. URLs consist (in a basic sense) of the server portion (to the left of the first single / ) and the path portion (right of the first / and left of a ? or #) and either a fragment (after the #) or a query string (after the ?) I would at least split out the server portion into it's own table. For each page request (assume 1 page and 9 pictures, all from the same server) that would be 10 rows of log data that all contain the same chunk of similar information. Reducing that heavily repeated portion of your data to an ID number will greatly help reduce the size of your database. About the non-uniqueness of your internet_usage table. Even if the same user visits the same URL multiple times (is that what you mean by repeated records?) the times should all be slightly different. If they are not different, it is still possible that the same person requested the same page twice or more during the same second (the auto-login feature of MSIE comes to mind as one culprit). OR you could have multiple users all on the same userID hitting the same page from different machines I guess I would have to see more data to understand your problem better. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/18/2004 09:40 Subject: Re: load data into 2 tables and set id AM Shawn, Thanks for helping on this. I really appreciate it. No problem!! Please post the structures of your big_table and your url_table (whatever you called them) and I will help you to rewrite step 4 to count how many times a URL appears in the big_table. mysql desc internet_usage; ++-+--+-+-+---+ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-+---+ | uid| varchar(10) | YES | MUL | NULL| | | time | datetime| YES | | NULL| | | ip | varchar(20) | YES | | NULL| | | action | varchar(20) | YES | | NULL| | | urlid | int(11) | YES | | NULL| | | size | int(11) | YES | | NULL| | ++-+--+-+-+---+ 6 rows in set (0.03 sec) mysql desc url_table; +--+--+--+-+-++ | Field| Type | Null | Key | Default | Extra| +--+--+--+-+-++ | urlid| mediumint(9) | | PRI | NULL| auto_increment | | url | text | YES | MUL | NULL|| | hits | mediumint(9) | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +--+--+--+-+-++ 4 rows in set (0.01 sec) mysql Is this bulk import a process you
Re: load data into 2 tables and set id
Shawn, Thanks for the email below - I will go through it over later tonight in more detail (going home time now!) although I can already see good points there. JS, I need one more piece of information to help make that query work for you, I need the structure of the table that you use (or would use) to bulk import those logs into. I'm completely flexible over this as I'm still at the testing stage. You might be able to suggest the most efficient structure? Isn't the problem getting the urlid before I insert into internet_usage so that I can find the associated url. If you are running out of room you may consider further normalizing you data (which saves space, but creates more data maintenance steps). I, too, have had to deal with millions of rows of internet usage logs so I understand your pain. You can store your IP addresses in an INT and get them back in dotted notation with the MySQL functions INET_ATON() and INET_NTOA(). That will save you an average of 10 bytes PER ROW (it adds up when you are into the millions of rows). Create a Proxy_Action table (id tinyint, action varchar(20) ) and populate it with all of the actions your proxy/firewall can log. Then replace the column action varchar(20) with ProxyAction_id tinyint. (I assume there are less than 256 action messages available from your proxy?) That's 1 byte vs. 8 at the low end of your sample data (counting the null at the end of the string). First add the new column to the table, populate it with the ID values from your new Action table, then drop the old column. URLs consist (in a basic sense) of the server portion (to the left of the first single / ) and the path portion (right of the first / and left of a ? or #) and either a fragment (after the #) or a query string (after the ?) I would at least split out the server portion into it's own table. For each page request (assume 1 page and 9 pictures, all from the same server) that would be 10 rows of log data that all contain the same chunk of similar information. Reducing that heavily repeated portion of your data to an ID number will greatly help reduce the size of your database. About the non-uniqueness of your internet_usage table. Even if the same user visits the same URL multiple times (is that what you mean by repeated records?) the times should all be slightly different. If they are not different, it is still possible that the same person requested the same page twice or more during the same second (the auto-login feature of MSIE comes to mind as one culprit). OR you could have multiple users all on the same userID hitting the same page from different machines I guess I would have to see more data to understand your problem better. Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine J S [EMAIL PROTECTED]To: [EMAIL PROTECTED] com cc: [EMAIL PROTECTED] Fax to: 06/18/2004 09:40 Subject: Re: load data into 2 tables and set id AM Shawn, Thanks for helping on this. I really appreciate it. No problem!! Please post the structures of your big_table and your url_table (whatever you called them) and I will help you to rewrite step 4 to count how many times a URL appears in the big_table. mysql desc internet_usage; ++-+--+-+-+---+ | Field | Type| Null | Key | Default | Extra | ++-+--+-+-+---+ | uid| varchar(10) | YES | MUL | NULL| | | time | datetime| YES | | NULL| | | ip | varchar(20) | YES | | NULL| | | action | varchar(20) | YES | | NULL| | | urlid | int(11) | YES | | NULL| | | size | int(11) | YES | | NULL| | ++-+--+-+-+---+ 6 rows in set (0.03 sec) mysql desc url_table; +--+--+--+-+-++ | Field| Type | Null | Key | Default | Extra| +--+--+--+-+-++ | urlid| mediumint(9) | | PRI | NULL| auto_increment | | url | text | YES | MUL | NULL|| | hits | mediumint(9) | YES | | NULL|| | category | varchar(50) | YES | | NULL|| +--+--+--+-+-++ 4 rows in set (0.01 sec) mysql Is this bulk import a process you need to do repeatedly? if so we need to worry about updating the count column on the next batch import and not just re-creating it from the new data. That will change the query significantly. Yes it will be a repeated process. Actually I have a backlog of 6 months data to load! Here's an example of what the data looks like: mysql select * from
Re: load data into 2 tables and set id
welcome to a basic overview of bulk importing and normalizing as you go [ author's note: if you are seeing this thread for the first time and certain items seem to be introduced out of context, please review all previous posts in this thread. There has been a lot of information already exchanged on this topic that I clipped out of this response. Thanks! -- SG] In an earlier post you said The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business To me that implies that you have a standard text log where each field is separated by a space and rows are delimited by a CRLF pair. To get that log into MySQL you will need to do something _like_ (I say like as you will most likely need to tweak it to accommodate your actual data) CREATE TABLE bulk_table ( user varchar(10) , eventdate date , eventtime time , size int , url text , category varchar(50) ); LOAD DATA INFILE a proxy log file INTO bulk_table FIELDS TERMINATED BY ' ' LINES TERMINATED BY '\r\n'; Now we can start merging the logs into your data tables. To quote RFC 2396: The URI syntax does not require that the scheme-specific-part have any general structure or set of semantics which is common among all URI. However, a subset of URI do share a common syntax for representing hierarchical relationships within the namespace. This generic URI syntax consists of a sequence of four main components: scheme://authoritypath?query each of which, except scheme, may be absent from a particular URI. That translates into 4 logical pieces we can split a URL into: 1) the scheme -- HTTP, FTP, GOPHER, etc... 2) the authority -- that is the server (www.yahoo.com) and any port numbers or login information 3) the path -- /somefolder/somefile.whatever 4) the query -- everything after the ? Not part of the generic URI is that bit known as a fragment (as identified by the #) it is ALWAYS at the end of the _entire URL_ (including the query) when it's used. I have always lumped those and queries into the same field. So you could create tables for each of those parts and get VERY normalized or you can partly normalize like this: ALTER TABLE bulk_table add server_split int not null default 0 , add path_split int not null default 0 , add server varchar(255) , add path varchar(255) UPDATE bulk_table set server_split = LOCATE('/', URL , 8)-1; UPDATE bulk_table SET path_split = if(LOCATE('?', URL, server_split) 0, LOCATE('?', URL, server_split), LOCATE('#', URL, server_split)-1); UPDATE bulk_table set server=LEFT(URL, server_split ) #those 4 new columns helped us to parse out the 3 major parts of the url #I added them to the table so that we would not have to keep recalculating those values later on # if it turns out that adding the columns takes a LONG time, we can create this table with those columns # already created and just not import to them (change the LOAD DATA INFILE statement slightly) CREATE TABLE IF NOT EXISTS url_servers ( ID int not null auto_increment, server varchar(255) primary key, Key (ID) ) CREATE TABLE IF NOT EXISTS url_paths ( ID int not null auto_increment, path varchar(255) primary key, Key (ID) ) INSERT IGNORE INTO url_servers(server) SELECT DISTINCT server FROM bulk_table INSERT IGNORE INTO url_paths (path) SELECT DISTINCT path FROM bulk_table # at this point we have all of our new Servers and our Paths uniquely numbered # but we are going to need a slightly different URL table to track visits. CREATE TABLE url_visit ( urlid mediumint not null auto_increment primary key, url_server_ID int not null default 0, url_path_ID int not null default 0, querystring text default null, category varchar(50)default null, KEY(url_server_ID, url_path_ID) ) ## that last key is to speed up our joins to our _servers and _paths tables... # we finally have enough information to insert to the visit table INSERT url_visit (url_server_ID, url_path_ID, querystring, category) SELECT us.ID, up.ID, if(bt.path_split 0, SUBSTRING(bt.url,path), NULL), bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up on up.path=bt.path ## (see where the new pre-computed columns come in handy?) :-D # and now we have enough information to load the internet_usage table. Though there isn't enough data in your sample # source data to fill in all of the columns INSERT internet_usage ( uid,`time`,urlid, size) SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size FROM bulk_table bt INNER JOIN url_servers us ON us.server = bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_visit uv ON uv.url_server_ID = us.ID AND uv.url_path_ID = up.id AND uv.querystring = if(bt.path_split 0, SUBSTRING(bt.url,path), NULL) It may not be perfect but it's how
Re: load data into 2 tables and set id
Hi, I need some help please! I have 60GB of proxy logs to parse and load into a mysql database. I've written a parsing script but I'm stuck now on how to load the data in. I have a database called PROXY_LOG with 2 tables: USER_TABLE user_id date_time url_id size and URL_TABLE url_id url category The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business The problem is I'm not sure how to load the data into the 2 tables and set the url_id. Could anyone point me in the right direction please? Many thanks, js. js, Is this a one time operation? If so, you could create a temporary table with all the fields: BIG_TABLE user_id date_time url_id size url category and use Load Data InFile to load the data into this one table. Then create the two smaller tables using: create table USER_TABLE select user_id, date_time, url_id, size from BIG_TABLE; create table URL_TABLE select url_id, url, category from BIG_TABLE; drop table BIG_TABLE; So there you have it in 3 lines of code. :-) Mike P.S. If you could break the text file into two files to begin with, then two Load Data InFile statements would handle things nicely. Thanks Mos, that sounds ideal but the url_id value has to come from the database. js. _ Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
Mos forgot to populate the url_id column in your user table. I would use his same process but re-arrange it like this: 1) create table BIG_TABLE 2) load data infile 3) create table URL_TABLE ( url_id bigint not null auto_increment, url varchar(25) not null primary key, category ) 4) INSERT IGNORE URL_TABLE (url, category) SELECT url,category FROM BIG_TABLE 4) create table USER_TABLE ( user_id varchar?(...) primary key, date_time datetime, url_id bigint, size int ) 5) insert USER_TABLE SELECT bt.userID, bt.datetime, u.url_id, bt.size FROM BIG_TABLE bt INNER JOIN URL_TABLE u ON u.url = bt.url doing it this way lets the SQL engine handle the conversion of URLs to their IDs in the USER_TABLE...see? No scripting required at all! Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine mos [EMAIL PROTECTED]To: [EMAIL PROTECTED] m cc: Fax to: 06/09/2004 05:00 Subject: Re: load data into 2 tables and set id PM At 02:34 PM 6/9/2004, you wrote: Hi, I need some help please! I have 60GB of proxy logs to parse and load into a mysql database. I've written a parsing script but I'm stuck now on how to load the data in. I have a database called PROXY_LOG with 2 tables: USER_TABLE user_id date_time url_id size and URL_TABLE url_id url category The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business The problem is I'm not sure how to load the data into the 2 tables and set the url_id. Could anyone point me in the right direction please? Many thanks, js. js, Is this a one time operation? If so, you could create a temporary table with all the fields: BIG_TABLE user_id date_time url_id size url category and use Load Data InFile to load the data into this one table. Then create the two smaller tables using: create table USER_TABLE select user_id, date_time, url_id, size from BIG_TABLE; create table URL_TABLE select url_id, url, category from BIG_TABLE; drop table BIG_TABLE; So there you have it in 3 lines of code. :-) Mike P.S. If you could break the text file into two files to begin with, then two Load Data InFile statements would handle things nicely. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED] -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
J S said: Hi, I need some help please! I have 60GB of proxy logs to parse and load into a mysql database. I've written a parsing script but I'm stuck now on how to load the data in. I have a database called PROXY_LOG with 2 tables: USER_TABLE user_id date_time url_id size and URL_TABLE url_id url category The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business The problem is I'm not sure how to load the data into the 2 tables and set the url_id. Could anyone point me in the right direction please? Many thanks, js. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger What language did you use to do the parsing. If it was perl I'd recommend looking at using perl's DBI interface and doing it progromaticaly. Do you have any idea of how many different url's you have. Basic steps: Get record, check to see if url is in database, if it is get the url_id. if not insert it and get the generated url_id. insert the user record using the url_id you now have. repeat until you run out of records. Not elegent but it will get the job done. Note look into documentation on how to get the new url_id after you do an insert. Its in the DBD::mysql for perl. -- William R. Mussatto, Senior Systems Engineer Ph. 909-920-9154 ext. 27 FAX. 909-608-7061 -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
At 02:34 PM 6/9/2004, you wrote: Hi, I need some help please! I have 60GB of proxy logs to parse and load into a mysql database. I've written a parsing script but I'm stuck now on how to load the data in. I have a database called PROXY_LOG with 2 tables: USER_TABLE user_id date_time url_id size and URL_TABLE url_id url category The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business The problem is I'm not sure how to load the data into the 2 tables and set the url_id. Could anyone point me in the right direction please? Many thanks, js. js, Is this a one time operation? If so, you could create a temporary table with all the fields: BIG_TABLE user_id date_time url_id size url category and use Load Data InFile to load the data into this one table. Then create the two smaller tables using: create table USER_TABLE select user_id, date_time, url_id, size from BIG_TABLE; create table URL_TABLE select url_id, url, category from BIG_TABLE; drop table BIG_TABLE; So there you have it in 3 lines of code. :-) Mike P.S. If you could break the text file into two files to begin with, then two Load Data InFile statements would handle things nicely. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
Hi, I need some help please! I have 60GB of proxy logs to parse and load into a mysql database. I've written a parsing script but I'm stuck now on how to load the data in. I have a database called PROXY_LOG with 2 tables: USER_TABLE user_id date_time url_id size and URL_TABLE url_id url category The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business The problem is I'm not sure how to load the data into the 2 tables and set the url_id. Could anyone point me in the right direction please? Many thanks, js. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger What language did you use to do the parsing. If it was perl I'd recommend looking at using perl's DBI interface and doing it progromaticaly. Do you have any idea of how many different url's you have. Basic steps: Get record, check to see if url is in database, if it is get the url_id. if not insert it and get the generated url_id. insert the user record using the url_id you now have. repeat until you run out of records. Not elegent but it will get the job done. Note look into documentation on how to get the new url_id after you do an insert. Its in the DBD::mysql for perl. Thanks for your reply William. I am using a perl script. If I have to insert these records one by one it's going to be really slow isn't it? Maybe the quickest way is to parse the logs twice i.e. fill the URL_TABLE first using your procedure above, then on the second run, create a LOAD file for USER_TABLE? js. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]
Re: load data into 2 tables and set id
J S said: Hi, I need some help please! I have 60GB of proxy logs to parse and load into a mysql database. I've written a parsing script but I'm stuck now on how to load the data in. I have a database called PROXY_LOG with 2 tables: USER_TABLE user_id date_time url_id size and URL_TABLE url_id url category The data values I have for each record are: user_id date_time size url category for example: u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html business The problem is I'm not sure how to load the data into the 2 tables and set the url_id. Could anyone point me in the right direction please? Many thanks, js. _ It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger What language did you use to do the parsing. If it was perl I'd recommend looking at using perl's DBI interface and doing it progromaticaly. Do you have any idea of how many different url's you have. Basic steps: Get record, check to see if url is in database, if it is get the url_id. if not insert it and get the generated url_id. insert the user record using the url_id you now have. repeat until you run out of records. Not elegent but it will get the job done. Note look into documentation on how to get the new url_id after you do an insert. Its in the DBD::mysql for perl. Thanks for your reply William. I am using a perl script. If I have to insert these records one by one it's going to be really slow isn't it? Maybe the quickest way is to parse the logs twice i.e. fill the URL_TABLE first using your procedure above, then on the second run, create a LOAD file for USER_TABLE? js. How will you get the information for the url-id's? I can see splitting the logs and using a load file for the url_table (if you can eliminate duplicates). You can save some time, if you can build a perl hash with the $url{urlvalue} = url_id. Test that and only do inserts if you need to. Hash look up is faster than db query, but you will have to have the hash in memory. You can use the hash to prepare the USER_TABLE and then load infile that. Just thought, url is going to have to be a unique key? You can speed up the initial inserts by inserting without that key (using the perl hash to avoid collisions) and then altering table to add the key in. However, question comes back to do you have enough memory for the hash in perl? Notice also, that you don't have a rowID equivalent in the USER_TABLE -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe:http://lists.mysql.com/[EMAIL PROTECTED]