I think I fixed it!

INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring,
category)
SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
NULL),
bt.category
FROM bulk_table bt
INNER JOIN url_servers us
      ON us.server = bt.server
INNER JOIN url_paths up
      on up.path=bt.path

After this I ran the sql to insert into internet_usage and this time it finished after 10m32.

I'll do some more testing and let you know how it goes.

Thanks,

js.

Did you mean there to be duplicates in the url_visits? Do I need to use IGNORE in the following SQL?


INSERT url_visit (url_server_ID, url_path_ID, querystring,
category)
SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
NULL),
bt.category
FROM bulk_table bt
INNER JOIN url_servers us
      ON us.server = bt.server
INNER JOIN url_paths up
      on up.path=bt.path

js.

mysql> select * from url_visit where urlid=1631;
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
| urlid | url_server_ID | url_path_ID | query | category |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
| 1631 | 21720 | 630695 | cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
1 row in set (0.01 sec)


mysql> select * from url_visit where urlid=1753;
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
| urlid | url_server_ID | url_path_ID | query | category |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
| 1753 | 21720 | 630695 | cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift | Online Sales |
+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+
1 row in set (0.00 sec)



Shawn,

Thanks for your reply below. I found it extremely useful. I have followed your instructions and got good results up to the url_visits table.

I have a perl script to parse the values out of the log. The log has 3,770,246 lines and is gzipped. I then applied your SQL statements with the following times:

1. Parse script: 10m31
2. Load infile: 2m39
3. Insert url_servers and url_paths: 5m38
4. Insert url_visits 9m50

However when I ran the sql to insert the internet usage table, it just ran until all the log space was filled up:

-rw-rw---- 1 mysql mysql 379694680 Jun 22 10:05 bulk_table.MYD <====
-rw-rw---- 1 mysql mysql 1024 Jun 22 10:05 bulk_table.MYI
-rw-rw---- 1 mysql mysql 8760 Jun 22 09:59 bulk_table.frm
-rw-rw---- 1 mysql mysql 2114977792 Jun 22 11:11 internet_usage.MYD <====
-rw-rw---- 1 mysql mysql 1024 Jun 22 10:34 internet_usage.MYI
-rw-rw---- 1 mysql mysql 8646 Jun 22 09:59 internet_usage.frm
-rw-rw---- 1 mysql mysql 33437600 Jun 22 10:23 url_paths.MYD
-rw-rw---- 1 mysql mysql 27696128 Jun 22 10:23 url_paths.MYI
-rw-rw---- 1 mysql mysql 8574 Jun 22 09:59 url_paths.frm
-rw-rw---- 1 mysql mysql 646160 Jun 22 10:19 url_servers.MYD
-rw-rw---- 1 mysql mysql 796672 Jun 22 10:19 url_servers.MYI
-rw-rw---- 1 mysql mysql 8578 Jun 22 09:59 url_servers.frm
-rw-rw---- 1 mysql mysql 119076844 Jun 22 10:32 url_visit.MYD
-rw-rw---- 1 mysql mysql 73026560 Jun 22 10:33 url_visit.MYI
-rw-rw---- 1 mysql mysql 8694 Jun 22 09:59 url_visit.frm


I'm sure something's not right because the internet Usage table is bigger than the bulk table. I changed your sql a bit (on the last line with the query). Could this be the problem?

INSERT internet_usage (uid,time,ip,urlid,size)
SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
FROM bulk_table bt
INNER JOIN url_servers us
       ON us.server=bt.server
INNER JOIN url_paths up
       ON up.path=bt.path
INNER JOIN url_visit uv
       ON uv.url_server_ID=us.ID
       AND uv.url_path_ID=up.ID
       AND uv.query=bt.query;

Thanks again,

js.


-------- welcome to a basic overview of bulk importing and normalizing as you go ------------

[ author's note: if you are seeing this thread for the first time and
certain items seem to be
introduced out of context, please review all previous posts in this thread.
There has been
a lot of information already exchanged on this topic that I clipped
out of this response. Thanks! -- SG]


In an earlier post you said
>The data values I have for each record are:
>
>user_id   date_time   size   url   category

>for example:
>
>u752359   2004-04-02 12:33:04   3403   http://www.mysql.com/index.html
business
>

To me that implies that you have a standard text log where each field is
separated by a space and rows are delimited by a CRLF pair. To get that log
into MySQL you will need to do something _like_ (I say "like" as you will
most likely need to tweak it to accommodate your actual data)


CREATE TABLE bulk_table (
      user varchar(10)
      , eventdate date
      , eventtime time
      , size int
      , url text
      , category varchar(50)
);

LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY '
' LINES TERMINATED BY '\r\n';


Now we can start merging the logs into your data tables.

To quote RFC 2396:
   The URI syntax does not require that the scheme-specific-part have
   any general structure or set of semantics which is common among all
   URI.  However, a subset of URI do share a common syntax for
   representing hierarchical relationships within the namespace.  This
   "generic URI" syntax consists of a sequence of four main components:

      <scheme>://<authority><path>?<query>

   each of which, except <scheme>, may be absent from a particular URI.

That translates into 4 logical pieces we can split a URL into:
1) the scheme -- HTTP, FTP, GOPHER, etc...
2) the authority -- that is the server (www.yahoo.com) and any port numbers
or login information
3) the path -- /somefolder/somefile.whatever
4) the query -- everything after the ?


Not part of the generic URI is that bit known as a fragment (as identified
by the #) it is ALWAYS at the end of the _entire URL_ (including the query)
when it's used. I have always lumped those and queries into the same field.


So you could create tables for each of those parts and get VERY normalized
or you can partly normalize like this:


ALTER TABLE bulk_table add server_split int not null default 0
      , add path_split int not null default 0
      , add server varchar(255)
      , add path varchar(255)

UPDATE bulk_table
set server_split = LOCATE('/', URL , 8)-1;

UPDATE bulk_table
SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL,
server_split),  LOCATE('#', URL, server_split)-1);

UPDATE bulk_table
set server=LEFT(URL, server_split )

#those 4 new columns helped us to parse out the 3 major parts of the url
#I added them to the table so that we would not have to keep recalculating
those values later on
# if it turns out that adding the columns takes a LONG time, we can create
this table with those columns
# already created and just not import to them (change the LOAD DATA INFILE
statement slightly)


CREATE TABLE IF NOT EXISTS url_servers (
ID int not null auto_increment,
server varchar(255) primary key,
Key (ID)
)

CREATE TABLE IF NOT EXISTS url_paths (
ID int not null auto_increment,
path varchar(255) primary key,
Key (ID)
)

INSERT IGNORE INTO url_servers(server)
SELECT DISTINCT server
FROM bulk_table

INSERT IGNORE INTO url_paths (path)
SELECT DISTINCT path
FROM bulk_table

# at this point we have all of our new Servers and our Paths uniquely
numbered
# but we are going to need a slightly different URL table to track visits.


CREATE TABLE url_visit (
urlid mediumint not null auto_increment primary key,
url_server_ID int not null default 0,
url_path_ID int not null default 0,
querystring text default null,
category varchar(50)default null,
KEY(url_server_ID, url_path_ID)
)

## that last key is to speed up our joins to our _servers and _paths
tables...

#  we finally have enough information to insert to the visit table

INSERT url_visit (url_server_ID, url_path_ID, querystring, category)
SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL),
bt.category
FROM bulk_table bt
INNER JOIN url_servers us
      ON us.server = bt.server
INNER JOIN url_paths up
      on up.path=bt.path

## (see where the new pre-computed columns come in handy?) :-D

# and now we have enough information to load the internet_usage table.
Though there isn't enough data in your sample
# source data to fill in all of the columns

INSERT internet_usage ( uid,`time`,urlid, size)
SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
FROM bulk_table bt
INNER JOIN url_servers us
      ON us.server = bt.server
INNER JOIN url_paths up
      ON up.path=bt.path
INNER JOIN url_visit uv
      ON uv.url_server_ID = us.ID
      AND uv.url_path_ID = up.id
      AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path),
NULL)


It may not be perfect but it's how I would do it. I am sure that others out
there would have a twist or two to put on the process. When it's all over
and done, clear out the bulk_table and do it all over again.


I would not mix a statistic, like hit count, into the url_visit table. This
next query should be blazingly fast as both columns are part of a compound
index.


SELECT url_server_ID, url_path_ID, count(1)
from url_visit
group by url_server_ID,url_path_Id;

-OR- if you would rather see full names:

create temporary table tmpStats
SELECT url_server_ID, url_path_ID, count(1) as hits
from url_visit
group by url_server_ID,url_path_Id;

select concat(us.server,up.path), ts.hits
from tmpStats
inner join url_servers us
on us.id = ts.url_server_ID
inner join url_paths up
on up.id = ts.url_path_ID

drop table tmpStats;

(NOTE: I used a temp table to aggregate on the ID values before resolving
to the names because resolving those BEFORE running the GROUP BY would have
eliminated the use of an existing index.)


Well, I have to get back to work. Let me know how this works for you, OK?

Cheers,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine


------------all previous responses clipped for space ---------------------





-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]


_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger



-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]


_________________________________________________________________
It's fast, it's easy and it's free. Get MSN Messenger today! http://www.msn.co.uk/messenger



-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]


_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now! http://toolbar.msn.co.uk/



-- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]



Reply via email to