J S, Check to see if the url_scheme_ID part of a unique constraint/key or the Primary Key? If it isn't, you will get dupes even if you use INSERT IGNORE unless you pre-screen your INSERTs some other way to avoid duplication.
Yours, Shawn Green Database Administrator Unimin Corporation - Spruce Pine "J S" <[EMAIL PROTECTED] To: [EMAIL PROTECTED] com> cc: [EMAIL PROTECTED] Fax to: 06/23/2004 04:13 Subject: Re: load data into 2 tables and set id AM Shawn, I uncovered a problem this morning. I wonder if you (or anyone else) can help me out again? mysql> select * from url_visit where url_scheme_ID=3 limit 10; +---------+---------------+---------------+-------------+-------+----------+ | urlid | url_scheme_ID | url_server_ID | url_path_ID | query | category | +---------+---------------+---------------+-------------+-------+----------+ | 23392 | 3 | 1070 | 22221 | NULL | none | | 1346269 | 3 | 1070 | 22221 | NULL | none | +---------+---------------+---------------+-------------+-------+----------+ 2 rows in set (0.00 sec) This is the insert statement I'm using: INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query, category) SELECT DISTINCT uc.ID, us.ID, up.ID,bt.query,bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_schemes uc ON uc.scheme=bt.scheme Do I need brackets after the distinct? e.g SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category) Thanks, js. > > >Great catch! I believe you found the problem. Sorry about that! ;-D > >So... just curious... from the original 60GB of text data, how much space >did the final data require (including indices)? > >Yours, >Shawn Green >Database Administrator >Unimin Corporation - Spruce Pine > > > > > "J S" > <[EMAIL PROTECTED] To: [EMAIL PROTECTED] > com> cc: >[EMAIL PROTECTED] > Fax to: > 06/22/2004 07:55 Subject: Re: load data >into 2 tables and set id > AM > > > > > > >I think I fixed it! > >INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring, >category) >SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), >NULL), >bt.category >FROM bulk_table bt >INNER JOIN url_servers us > ON us.server = bt.server >INNER JOIN url_paths up > on up.path=bt.path > >After this I ran the sql to insert into internet_usage and this time it >finished after 10m32. > >I'll do some more testing and let you know how it goes. > >Thanks, > >js. > > > >Did you mean there to be duplicates in the url_visits? Do I need to use > >IGNORE in the following SQL? > > > >INSERT url_visit (url_server_ID, url_path_ID, querystring, > >category) > >SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), > >NULL), > >bt.category > >FROM bulk_table bt > >INNER JOIN url_servers us > > ON us.server = bt.server > >INNER JOIN url_paths up > > on up.path=bt.path > > > >js. > > > >mysql> select * from url_visit where urlid=1631; > >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ > > >| urlid | url_server_ID | url_path_ID | query > > > > > > | category | > >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ > > >| 1631 | 21720 | 630695 | > >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift > > >| Online Sales | > >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ > > >1 row in set (0.01 sec) > > > >mysql> select * from url_visit where urlid=1753; > >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ > > >| urlid | url_server_ID | url_path_ID | query > > > > > > | category | > >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ > > >| 1753 | 21720 | 630695 | > >cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift > > >| Online Sales | > >+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+ > > >1 row in set (0.00 sec) > > > >> > >>Shawn, > >> > >>Thanks for your reply below. I found it extremely useful. I have >followed > > >>your instructions and got good results up to the url_visits table. > >> > >>I have a perl script to parse the values out of the log. The log has > >>3,770,246 lines and is gzipped. I then applied your SQL statements with > >>the following times: > >> > >>1. Parse script: 10m31 > >>2. Load infile: 2m39 > >>3. Insert url_servers and url_paths: 5m38 > >>4. Insert url_visits 9m50 > >> > >>However when I ran the sql to insert the internet usage table, it just >ran > >>until all the log space was filled up: > >> > >>-rw-rw---- 1 mysql mysql 379694680 Jun 22 10:05 bulk_table.MYD > > >> <==== > >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI > >>-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm > >>-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11 > >>internet_usage.MYD <==== > >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34 internet_usage.MYI > >>-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59 internet_usage.frm > >>-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD > >>-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI > >>-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm > >>-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD > >>-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI > >>-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm > >>-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD > >>-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI > >>-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm > >> > >>I'm sure something's not right because the internet Usage table is >bigger > > >>than the bulk table. I changed your sql a bit (on the last line with the > >>query). Could this be the problem? > >> > >>INSERT internet_usage (uid,time,ip,urlid,size) > >>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size > >>FROM bulk_table bt > >>INNER JOIN url_servers us > >> ON us.server=bt.server > >>INNER JOIN url_paths up > >> ON up.path=bt.path > >>INNER JOIN url_visit uv > >> ON uv.url_server_ID=us.ID > >> AND uv.url_path_ID=up.ID > >> AND uv.query=bt.query; > >> > >>Thanks again, > >> > >>js. > >> > >>> > >>>-------- welcome to a basic overview of bulk importing and normalizing >as > >>>you go ------------ > >>> > >>>[ author's note: if you are seeing this thread for the first time and > >>>certain items seem to be > >>>introduced out of context, please review all previous posts in this > >>>thread. > >>>There has been > >>>a lot of information already exchanged on this topic that I clipped > >>>out of this response. Thanks! -- SG] > >>> > >>>In an earlier post you said > >>> >The data values I have for each record are: > >>> > > >>> >user_id date_time size url category > >>> > >>> >for example: > >>> > > >>> >u752359 2004-04-02 12:33:04 3403 >http://www.mysql.com/index.html > >>>business > >>> > > >>> > >>>To me that implies that you have a standard text log where each field >is > >>>separated by a space and rows are delimited by a CRLF pair. To get that > >>>log > >>>into MySQL you will need to do something _like_ (I say "like" as you >will > >>>most likely need to tweak it to accommodate your actual data) > >>> > >>>CREATE TABLE bulk_table ( > >>> user varchar(10) > >>> , eventdate date > >>> , eventtime time > >>> , size int > >>> , url text > >>> , category varchar(50) > >>>); > >>> > >>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED >BY > > >>>' > >>>' LINES TERMINATED BY '\r\n'; > >>> > >>>Now we can start merging the logs into your data tables. > >>> > >>>To quote RFC 2396: > >>> The URI syntax does not require that the scheme-specific-part have > >>> any general structure or set of semantics which is common among all > >>> URI. However, a subset of URI do share a common syntax for > >>> representing hierarchical relationships within the namespace. This > >>> "generic URI" syntax consists of a sequence of four main >components: > >>> > >>> <scheme>://<authority><path>?<query> > >>> > >>> each of which, except <scheme>, may be absent from a particular >URI. > >>> > >>>That translates into 4 logical pieces we can split a URL into: > >>>1) the scheme -- HTTP, FTP, GOPHER, etc... > >>>2) the authority -- that is the server (www.yahoo.com) and any port > >>>numbers > >>>or login information > >>>3) the path -- /somefolder/somefile.whatever > >>>4) the query -- everything after the ? > >>> > >>>Not part of the generic URI is that bit known as a fragment (as > >>>identified > >>>by the #) it is ALWAYS at the end of the _entire URL_ (including the > >>>query) > >>>when it's used. I have always lumped those and queries into the same > >>>field. > >>> > >>>So you could create tables for each of those parts and get VERY > >>>normalized > >>>or you can partly normalize like this: > >>> > >>>ALTER TABLE bulk_table add server_split int not null default 0 > >>> , add path_split int not null default 0 > >>> , add server varchar(255) > >>> , add path varchar(255) > >>> > >>>UPDATE bulk_table > >>>set server_split = LOCATE('/', URL , 8)-1; > >>> > >>>UPDATE bulk_table > >>>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL, > >>>server_split), LOCATE('#', URL, server_split)-1); > >>> > >>>UPDATE bulk_table > >>>set server=LEFT(URL, server_split ) > >>> > >>>#those 4 new columns helped us to parse out the 3 major parts of the >url > >>>#I added them to the table so that we would not have to keep > >>>recalculating > >>>those values later on > >>># if it turns out that adding the columns takes a LONG time, we can > >>>create > >>>this table with those columns > >>># already created and just not import to them (change the LOAD DATA > >>>INFILE > >>>statement slightly) > >>> > >>>CREATE TABLE IF NOT EXISTS url_servers ( > >>>ID int not null auto_increment, > >>>server varchar(255) primary key, > >>>Key (ID) > >>>) > >>> > >>>CREATE TABLE IF NOT EXISTS url_paths ( > >>>ID int not null auto_increment, > >>>path varchar(255) primary key, > >>>Key (ID) > >>>) > >>> > >>>INSERT IGNORE INTO url_servers(server) > >>>SELECT DISTINCT server > >>>FROM bulk_table > >>> > >>>INSERT IGNORE INTO url_paths (path) > >>>SELECT DISTINCT path > >>>FROM bulk_table > >>> > >>># at this point we have all of our new Servers and our Paths uniquely > >>>numbered > >>># but we are going to need a slightly different URL table to track > >>>visits. > >>> > >>>CREATE TABLE url_visit ( > >>>urlid mediumint not null auto_increment primary key, > >>>url_server_ID int not null default 0, > >>>url_path_ID int not null default 0, > >>>querystring text default null, > >>>category varchar(50)default null, > >>>KEY(url_server_ID, url_path_ID) > >>>) > >>> > >>>## that last key is to speed up our joins to our _servers and _paths > >>>tables... > >>> > >>># we finally have enough information to insert to the visit table > >>> > >>>INSERT url_visit (url_server_ID, url_path_ID, querystring, category) > >>>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), >NULL), > >>>bt.category > >>>FROM bulk_table bt > >>>INNER JOIN url_servers us > >>> ON us.server = bt.server > >>>INNER JOIN url_paths up > >>> on up.path=bt.path > >>> > >>>## (see where the new pre-computed columns come in handy?) :-D > >>> > >>># and now we have enough information to load the internet_usage table. > >>>Though there isn't enough data in your sample > >>># source data to fill in all of the columns > >>> > >>>INSERT internet_usage ( uid,`time`,urlid, size) > >>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size > >>>FROM bulk_table bt > >>>INNER JOIN url_servers us > >>> ON us.server = bt.server > >>>INNER JOIN url_paths up > >>> ON up.path=bt.path > >>>INNER JOIN url_visit uv > >>> ON uv.url_server_ID = us.ID > >>> AND uv.url_path_ID = up.id > >>> AND uv.querystring = if(bt.path_split >0, >SUBSTRING(bt.url,path), > >>>NULL) > >>> > >>> > >>>It may not be perfect but it's how I would do it. I am sure that others > >>>out > >>>there would have a twist or two to put on the process. When it's all > >>>over > >>>and done, clear out the bulk_table and do it all over again. > >>> > >>>I would not mix a statistic, like hit count, into the url_visit table. > >>>This > >>>next query should be blazingly fast as both columns are part of a > >>>compound > >>>index. > >>> > >>>SELECT url_server_ID, url_path_ID, count(1) > >>>from url_visit > >>>group by url_server_ID,url_path_Id; > >>> > >>>-OR- if you would rather see full names: > >>> > >>>create temporary table tmpStats > >>>SELECT url_server_ID, url_path_ID, count(1) as hits > >>>from url_visit > >>>group by url_server_ID,url_path_Id; > >>> > >>>select concat(us.server,up.path), ts.hits > >>>from tmpStats > >>>inner join url_servers us > >>>on us.id = ts.url_server_ID > >>>inner join url_paths up > >>>on up.id = ts.url_path_ID > >>> > >>>drop table tmpStats; > >>> > >>>(NOTE: I used a temp table to aggregate on the ID values before >resolving > >>>to the names because resolving those BEFORE running the GROUP BY would > >>>have > >>>eliminated the use of an existing index.) > >>> > >>>Well, I have to get back to work. Let me know how this works for you, >OK? > >>> > >>>Cheers, > >>>Shawn Green > >>>Database Administrator > >>>Unimin Corporation - Spruce Pine > >>> > >>> > >>>------------all previous responses clipped for space > >>>--------------------- > >>> > >>> > > > > > >-- >MySQL General Mailing List >For list archives: http://lists.mysql.com/mysql >To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED] > _________________________________________________________________ Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/ -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]