Shawn,

I uncovered a problem this morning. I wonder if you (or anyone else) can help me out again?

mysql> select * from url_visit where url_scheme_ID=3 limit 10;
+---------+---------------+---------------+-------------+-------+----------+
| urlid   | url_scheme_ID | url_server_ID | url_path_ID | query | category |
+---------+---------------+---------------+-------------+-------+----------+
|   23392 |             3 |          1070 |       22221 | NULL  | none     |
| 1346269 |             3 |          1070 |       22221 | NULL  | none     |
+---------+---------------+---------------+-------------+-------+----------+
2 rows in set (0.00 sec)

This is the insert statement I'm using:

INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query, category)
SELECT DISTINCT uc.ID, us.ID, up.ID,bt.query,bt.category
FROM bulk_table bt
INNER JOIN url_servers us ON us.server=bt.server
INNER JOIN url_paths up ON up.path=bt.path
INNER JOIN url_schemes uc ON uc.scheme=bt.scheme


Do I  need brackets after the distinct? e.g
SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category)

Thanks,

js.



Great catch! I believe you found the problem. Sorry about that!   ;-D

So... just curious... from the original 60GB of text data, how much space
did the final data require (including indices)?

Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine




"J S"
<[EMAIL PROTECTED] To: [EMAIL PROTECTED]
com> cc: [EMAIL PROTECTED]
Fax to:
06/22/2004 07:55 Subject: Re: load data into 2 tables and set id
AM







I think I fixed it!

INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring,
category)
SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
NULL),
bt.category
FROM bulk_table bt
INNER JOIN url_servers us
       ON us.server = bt.server
INNER JOIN url_paths up
       on up.path=bt.path

After this I ran the sql to insert into internet_usage and this time it
finished after 10m32.

I'll do some more testing and let you know how it goes.

Thanks,

js.
>
>Did you mean there to be duplicates in the url_visits? Do I need to use
>IGNORE in the following SQL?
>
>INSERT url_visit (url_server_ID, url_path_ID, querystring,
>category)
>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
>NULL),
>bt.category
>FROM bulk_table bt
>INNER JOIN url_servers us
>       ON us.server = bt.server
>INNER JOIN url_paths up
>       on up.path=bt.path
>
>js.
>
>mysql> select * from url_visit where urlid=1631;
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>| urlid | url_server_ID | url_path_ID | query

>

>     | category     |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>|  1631 |         21720 |      630695 |
>cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift

>| Online Sales |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>1 row in set (0.01 sec)
>
>mysql> select * from url_visit where urlid=1753;
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>| urlid | url_server_ID | url_path_ID | query

>

>     | category     |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>|  1753 |         21720 |      630695 |
>cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift

>| Online Sales |
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>1 row in set (0.00 sec)
>
>>
>>Shawn,
>>
>>Thanks for your reply below. I found it extremely useful. I have followed


>>your instructions and got good results up to the url_visits table.
>>
>>I have a perl script to parse the values out of the log. The log has
>>3,770,246 lines and is gzipped. I then applied your SQL statements with
>>the following times:
>>
>>1. Parse script: 10m31
>>2. Load infile: 2m39
>>3. Insert url_servers and url_paths: 5m38
>>4. Insert url_visits 9m50
>>
>>However when I ran the sql to insert the internet usage table, it just
ran
>>until all the log space was filled up:
>>
>>-rw-rw----   1 mysql    mysql    379694680 Jun 22 10:05 bulk_table.MYD

>> <====
>>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI
>>-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm
>>-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11
>>internet_usage.MYD <====
>>-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34 internet_usage.MYI
>>-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59 internet_usage.frm
>>-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD
>>-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI
>>-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm
>>-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD
>>-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI
>>-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm
>>-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD
>>-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI
>>-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm
>>
>>I'm sure something's not right because the internet Usage table is bigger


>>than the bulk table. I changed your sql a bit (on the last line with the
>>query). Could this be the problem?
>>
>>INSERT internet_usage (uid,time,ip,urlid,size)
>>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
>>FROM bulk_table bt
>>INNER JOIN url_servers us
>> ON us.server=bt.server
>>INNER JOIN url_paths up
>> ON up.path=bt.path
>>INNER JOIN url_visit uv
>> ON uv.url_server_ID=us.ID
>> AND uv.url_path_ID=up.ID
>> AND uv.query=bt.query;
>>
>>Thanks again,
>>
>>js.
>>
>>>
>>>-------- welcome to a basic overview of bulk importing and normalizing
as
>>>you go ------------
>>>
>>>[ author's note: if you are seeing this thread for the first time and
>>>certain items seem to be
>>>introduced out of context, please review all previous posts in this
>>>thread.
>>>There has been
>>>a lot of information already exchanged on this topic that I clipped
>>>out of this response. Thanks! -- SG]
>>>
>>>In an earlier post you said
>>> >The data values I have for each record are:
>>> >
>>> >user_id date_time size url category
>>>
>>> >for example:
>>> >
>>> >u752359 2004-04-02 12:33:04 3403 http://www.mysql.com/index.html
>>>business
>>> >
>>>
>>>To me that implies that you have a standard text log where each field is
>>>separated by a space and rows are delimited by a CRLF pair. To get that
>>>log
>>>into MySQL you will need to do something _like_ (I say "like" as you
will
>>>most likely need to tweak it to accommodate your actual data)
>>>
>>>CREATE TABLE bulk_table (
>>> user varchar(10)
>>> , eventdate date
>>> , eventtime time
>>> , size int
>>> , url text
>>> , category varchar(50)
>>>);
>>>
>>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY


>>>'
>>>' LINES TERMINATED BY '\r\n';
>>>
>>>Now we can start merging the logs into your data tables.
>>>
>>>To quote RFC 2396:
>>> The URI syntax does not require that the scheme-specific-part have
>>> any general structure or set of semantics which is common among all
>>> URI. However, a subset of URI do share a common syntax for
>>> representing hierarchical relationships within the namespace. This
>>> "generic URI" syntax consists of a sequence of four main components:
>>>
>>> <scheme>://<authority><path>?<query>
>>>
>>> each of which, except <scheme>, may be absent from a particular URI.
>>>
>>>That translates into 4 logical pieces we can split a URL into:
>>>1) the scheme -- HTTP, FTP, GOPHER, etc...
>>>2) the authority -- that is the server (www.yahoo.com) and any port
>>>numbers
>>>or login information
>>>3) the path -- /somefolder/somefile.whatever
>>>4) the query -- everything after the ?
>>>
>>>Not part of the generic URI is that bit known as a fragment (as
>>>identified
>>>by the #) it is ALWAYS at the end of the _entire URL_ (including the
>>>query)
>>>when it's used. I have always lumped those and queries into the same
>>>field.
>>>
>>>So you could create tables for each of those parts and get VERY
>>>normalized
>>>or you can partly normalize like this:
>>>
>>>ALTER TABLE bulk_table add server_split int not null default 0
>>> , add path_split int not null default 0
>>> , add server varchar(255)
>>> , add path varchar(255)
>>>
>>>UPDATE bulk_table
>>>set server_split = LOCATE('/', URL , 8)-1;
>>>
>>>UPDATE bulk_table
>>>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL,
>>>server_split), LOCATE('#', URL, server_split)-1);
>>>
>>>UPDATE bulk_table
>>>set server=LEFT(URL, server_split )
>>>
>>>#those 4 new columns helped us to parse out the 3 major parts of the url
>>>#I added them to the table so that we would not have to keep
>>>recalculating
>>>those values later on
>>># if it turns out that adding the columns takes a LONG time, we can
>>>create
>>>this table with those columns
>>># already created and just not import to them (change the LOAD DATA
>>>INFILE
>>>statement slightly)
>>>
>>>CREATE TABLE IF NOT EXISTS url_servers (
>>>ID int not null auto_increment,
>>>server varchar(255) primary key,
>>>Key (ID)
>>>)
>>>
>>>CREATE TABLE IF NOT EXISTS url_paths (
>>>ID int not null auto_increment,
>>>path varchar(255) primary key,
>>>Key (ID)
>>>)
>>>
>>>INSERT IGNORE INTO url_servers(server)
>>>SELECT DISTINCT server
>>>FROM bulk_table
>>>
>>>INSERT IGNORE INTO url_paths (path)
>>>SELECT DISTINCT path
>>>FROM bulk_table
>>>
>>># at this point we have all of our new Servers and our Paths uniquely
>>>numbered
>>># but we are going to need a slightly different URL table to track
>>>visits.
>>>
>>>CREATE TABLE url_visit (
>>>urlid mediumint not null auto_increment primary key,
>>>url_server_ID int not null default 0,
>>>url_path_ID int not null default 0,
>>>querystring text default null,
>>>category varchar(50)default null,
>>>KEY(url_server_ID, url_path_ID)
>>>)
>>>
>>>## that last key is to speed up our joins to our _servers and _paths
>>>tables...
>>>
>>># we finally have enough information to insert to the visit table
>>>
>>>INSERT url_visit (url_server_ID, url_path_ID, querystring, category)
>>>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL),
>>>bt.category
>>>FROM bulk_table bt
>>>INNER JOIN url_servers us
>>> ON us.server = bt.server
>>>INNER JOIN url_paths up
>>> on up.path=bt.path
>>>
>>>## (see where the new pre-computed columns come in handy?) :-D
>>>
>>># and now we have enough information to load the internet_usage table.
>>>Though there isn't enough data in your sample
>>># source data to fill in all of the columns
>>>
>>>INSERT internet_usage ( uid,`time`,urlid, size)
>>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
>>>FROM bulk_table bt
>>>INNER JOIN url_servers us
>>> ON us.server = bt.server
>>>INNER JOIN url_paths up
>>> ON up.path=bt.path
>>>INNER JOIN url_visit uv
>>> ON uv.url_server_ID = us.ID
>>> AND uv.url_path_ID = up.id
>>> AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path),
>>>NULL)
>>>
>>>
>>>It may not be perfect but it's how I would do it. I am sure that others
>>>out
>>>there would have a twist or two to put on the process. When it's all
>>>over
>>>and done, clear out the bulk_table and do it all over again.
>>>
>>>I would not mix a statistic, like hit count, into the url_visit table.
>>>This
>>>next query should be blazingly fast as both columns are part of a
>>>compound
>>>index.
>>>
>>>SELECT url_server_ID, url_path_ID, count(1)
>>>from url_visit
>>>group by url_server_ID,url_path_Id;
>>>
>>>-OR- if you would rather see full names:
>>>
>>>create temporary table tmpStats
>>>SELECT url_server_ID, url_path_ID, count(1) as hits
>>>from url_visit
>>>group by url_server_ID,url_path_Id;
>>>
>>>select concat(us.server,up.path), ts.hits
>>>from tmpStats
>>>inner join url_servers us
>>>on us.id = ts.url_server_ID
>>>inner join url_paths up
>>>on up.id = ts.url_path_ID
>>>
>>>drop table tmpStats;
>>>
>>>(NOTE: I used a temp table to aggregate on the ID values before
resolving
>>>to the names because resolving those BEFORE running the GROUP BY would
>>>have
>>>eliminated the use of an existing index.)
>>>
>>>Well, I have to get back to work. Let me know how this works for you,
OK?
>>>
>>>Cheers,
>>>Shawn Green
>>>Database Administrator
>>>Unimin Corporation - Spruce Pine
>>>
>>>
>>>------------all previous responses clipped for space
>>>---------------------
>>>
>>>






--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]


_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/



-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]



Reply via email to