Re: load data into 2 tables and set id

J S Wed, 23 Jun 2004 01:14:32 -0700

Shawn,

I uncovered a problem this morning. I wonder if you (or anyone else) can help me out again?

mysql> select * from url_visit where url_scheme_ID=3 limit 10;
+---------+---------------+---------------+-------------+-------+----------+
| urlid   | url_scheme_ID | url_server_ID | url_path_ID | query | category |
+---------+---------------+---------------+-------------+-------+----------+
|   23392 |             3 |          1070 |       22221 | NULL  | none     |
| 1346269 |             3 |          1070 |       22221 | NULL  | none     |
+---------+---------------+---------------+-------------+-------+----------+
2 rows in set (0.00 sec)

This is the insert statement I'm using:

INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query, category) SELECT DISTINCT uc.ID, us.ID, up.ID,bt.query,bt.category FROM bulk_table bt INNER JOIN url_servers us ON us.server=bt.server INNER JOIN url_paths up ON up.path=bt.path INNER JOIN url_schemes uc ON uc.scheme=bt.scheme

Do I  need brackets after the distinct? e.g
SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category)

Thanks,

js.

Great catch! I believe you found the problem. Sorry about that!   ;-D
So... just curious... from the original 60GB of text data, how much space
did the final data require (including indices)?
Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine
"J S" <[EMAIL PROTECTED] To: [EMAIL PROTECTED] com> cc: [EMAIL PROTECTED] Fax to: 06/22/2004 07:55 Subject: Re: load data into 2 tables and set id AM
I think I fixed it!
INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring,
category)
SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
NULL),
bt.category
FROM bulk_table bt
INNER JOIN url_servers us
       ON us.server = bt.server
INNER JOIN url_paths up
       on up.path=bt.path
After this I ran the sql to insert into internet_usage and this time it
finished after 10m32.
I'll do some more testing and let you know how it goes.
Thanks,
js.
>
>Did you mean there to be duplicates in the url_visits? Do I need to use
>IGNORE in the following SQL?
>
>INSERT url_visit (url_server_ID, url_path_ID, querystring,
>category)
>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
>NULL),
>bt.category
>FROM bulk_table bt
>INNER JOIN url_servers us
>       ON us.server = bt.server
>INNER JOIN url_paths up
>       on up.path=bt.path
>
>js.
>
>mysql> select * from url_visit where urlid=1631;
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>| urlid | url_server_ID | url_path_ID | query
>
>     | category     |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>|  1631 |         21720 |      630695 |
>cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
>| Online Sales |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>1 row in set (0.01 sec)
>
>mysql> select * from url_visit where urlid=1753;
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>| urlid | url_server_ID | url_path_ID | query
>
>     | category     |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>|  1753 |         21720 |      630695 |
>cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift
>| Online Sales |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
>1 row in set (0.00 sec) > >> >>Shawn, >> >>Thanks for your reply below. I found it extremely useful. I have followed
>>your instructions and got good results up to the url_visits table.
>>
>>I have a perl script to parse the values out of the log. The log has
>>3,770,246 lines and is gzipped. I then applied your SQL statements with
>>the following times:
>>
>>1. Parse script: 10m31
>>2. Load infile: 2m39
>>3. Insert url_servers and url_paths: 5m38
>>4. Insert url_visits 9m50
>>
>>However when I ran the sql to insert the internet usage table, it just
ran
>>until all the log space was filled up:
>>
>>-rw-rw----   1 mysql    mysql    379694680 Jun 22 10:05 bulk_table.MYD
>> <==== >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI >>-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm >>-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11 >>internet_usage.MYD <==== >>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34 internet_usage.MYI >>-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59 internet_usage.frm >>-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD >>-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI >>-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm >>-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD >>-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI >>-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm >>-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD >>-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI >>-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm >> >>I'm sure something's not right because the internet Usage table is bigger

>>than the bulk table. I changed your sql a bit (on the last line with the >>query). Could this be the problem? >> >>INSERT internet_usage (uid,time,ip,urlid,size) >>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size >>FROM bulk_table bt >>INNER JOIN url_servers us >> ON us.server=bt.server >>INNER JOIN url_paths up >> ON up.path=bt.path >>INNER JOIN url_visit uv >> ON uv.url_server_ID=us.ID >> AND uv.url_path_ID=up.ID >> AND uv.query=bt.query; >> >>Thanks again, >> >>js. >> >>> >>>-------- welcome to a basic overview of bulk importing and normalizing as >>>you go ------------ >>> >>>[ author's note: if you are seeing this thread for the first time and >>>certain items seem to be >>>introduced out of context, please review all previous posts in this >>>thread. >>>There has been >>>a lot of information already exchanged on this topic that I clipped >>>out of this response. Thanks! -- SG] >>> >>>In an earlier post you said >>> >The data values I have for each record are: >>> > >>> >user_id date_time size url category >>> >>> >for example: >>> > >>> >u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html >>>business >>> > >>> >>>To me that implies that you have a standard text log where each field is >>>separated by a space and rows are delimited by a CRLF pair. To get that >>>log >>>into MySQL you will need to do something _like_ (I say "like" as you will >>>most likely need to tweak it to accommodate your actual data) >>> >>>CREATE TABLE bulk_table ( >>> user varchar(10) >>> , eventdate date >>> , eventtime time >>> , size int >>> , url text >>> , category varchar(50) >>>); >>> >>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY

>>>' >>>' LINES TERMINATED BY '\r\n'; >>> >>>Now we can start merging the logs into your data tables. >>> >>>To quote RFC 2396: >>> The URI syntax does not require that the scheme-specific-part have >>> any general structure or set of semantics which is common among all >>> URI. However, a subset of URI do share a common syntax for >>> representing hierarchical relationships within the namespace. This >>> "generic URI" syntax consists of a sequence of four main components: >>> >>> <scheme>://<authority><path>?<query> >>> >>> each of which, except <scheme>, may be absent from a particular URI. >>> >>>That translates into 4 logical pieces we can split a URL into: >>>1) the scheme -- HTTP, FTP, GOPHER, etc... >>>2) the authority -- that is the server (www.yahoo.com) and any port >>>numbers >>>or login information >>>3) the path -- /somefolder/somefile.whatever >>>4) the query -- everything after the ? >>> >>>Not part of the generic URI is that bit known as a fragment (as >>>identified >>>by the #) it is ALWAYS at the end of the _entire URL_ (including the >>>query) >>>when it's used. I have always lumped those and queries into the same >>>field. >>> >>>So you could create tables for each of those parts and get VERY >>>normalized >>>or you can partly normalize like this: >>> >>>ALTER TABLE bulk_table add server_split int not null default 0 >>> , add path_split int not null default 0 >>> , add server varchar(255) >>> , add path varchar(255) >>> >>>UPDATE bulk_table >>>set server_split = LOCATE('/', URL , 8)-1; >>> >>>UPDATE bulk_table >>>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL, >>>server_split), LOCATE('#', URL, server_split)-1); >>> >>>UPDATE bulk_table >>>set server=LEFT(URL, server_split ) >>> >>>#those 4 new columns helped us to parse out the 3 major parts of the url >>>#I added them to the table so that we would not have to keep >>>recalculating >>>those values later on >>># if it turns out that adding the columns takes a LONG time, we can >>>create >>>this table with those columns >>># already created and just not import to them (change the LOAD DATA >>>INFILE >>>statement slightly) >>> >>>CREATE TABLE IF NOT EXISTS url_servers ( >>>ID int not null auto_increment, >>>server varchar(255) primary key, >>>Key (ID) >>>) >>> >>>CREATE TABLE IF NOT EXISTS url_paths ( >>>ID int not null auto_increment, >>>path varchar(255) primary key, >>>Key (ID) >>>) >>> >>>INSERT IGNORE INTO url_servers(server) >>>SELECT DISTINCT server >>>FROM bulk_table >>> >>>INSERT IGNORE INTO url_paths (path) >>>SELECT DISTINCT path >>>FROM bulk_table >>> >>># at this point we have all of our new Servers and our Paths uniquely >>>numbered >>># but we are going to need a slightly different URL table to track >>>visits. >>> >>>CREATE TABLE url_visit ( >>>urlid mediumint not null auto_increment primary key, >>>url_server_ID int not null default 0, >>>url_path_ID int not null default 0, >>>querystring text default null, >>>category varchar(50)default null, >>>KEY(url_server_ID, url_path_ID) >>>) >>> >>>## that last key is to speed up our joins to our _servers and _paths >>>tables... >>> >>># we finally have enough information to insert to the visit table >>> >>>INSERT url_visit (url_server_ID, url_path_ID, querystring, category) >>>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL), >>>bt.category >>>FROM bulk_table bt >>>INNER JOIN url_servers us >>> ON us.server = bt.server >>>INNER JOIN url_paths up >>> on up.path=bt.path >>> >>>## (see where the new pre-computed columns come in handy?) :-D >>> >>># and now we have enough information to load the internet_usage table. >>>Though there isn't enough data in your sample >>># source data to fill in all of the columns >>> >>>INSERT internet_usage ( uid,`time`,urlid, size) >>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size >>>FROM bulk_table bt >>>INNER JOIN url_servers us >>> ON us.server = bt.server >>>INNER JOIN url_paths up >>> ON up.path=bt.path >>>INNER JOIN url_visit uv >>> ON uv.url_server_ID = us.ID >>> AND uv.url_path_ID = up.id >>> AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path), >>>NULL) >>> >>> >>>It may not be perfect but it's how I would do it. I am sure that others >>>out >>>there would have a twist or two to put on the process. When it's all >>>over >>>and done, clear out the bulk_table and do it all over again. >>> >>>I would not mix a statistic, like hit count, into the url_visit table. >>>This >>>next query should be blazingly fast as both columns are part of a >>>compound >>>index. >>> >>>SELECT url_server_ID, url_path_ID, count(1) >>>from url_visit >>>group by url_server_ID,url_path_Id; >>> >>>-OR- if you would rather see full names: >>> >>>create temporary table tmpStats >>>SELECT url_server_ID, url_path_ID, count(1) as hits >>>from url_visit >>>group by url_server_ID,url_path_Id; >>> >>>select concat(us.server,up.path), ts.hits >>>from tmpStats >>>inner join url_servers us >>>on us.id = ts.url_server_ID >>>inner join url_paths up >>>on up.id = ts.url_path_ID >>> >>>drop table tmpStats; >>> >>>(NOTE: I used a temp table to aggregate on the ID values before resolving >>>to the names because resolving those BEFORE running the GROUP BY would >>>have >>>eliminated the use of an existing index.) >>> >>>Well, I have to get back to work. Let me know how this works for you, OK? >>> >>>Cheers, >>>Shawn Green >>>Database Administrator >>>Unimin Corporation - Spruce Pine >>> >>> >>>------------all previous responses clipped for space >>>--------------------- >>> >>>
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

_________________________________________________________________ Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: load data into 2 tables and set id

Reply via email to