Re: load data into 2 tables and set id

SGreen Fri, 18 Jun 2004 11:21:53 -0700

-------- welcome to a basic overview of bulk importing and normalizing as
you go ------------


[ author's note: if you are seeing this thread for the first time and
certain items seem to be
introduced out of context, please review all previous posts in this thread.
There has been
a lot of information already exchanged on this topic that I clipped
out of this response. Thanks! -- SG]

In an earlier post you said
>The data values I have for each record are:
>
>user_id   date_time   size   url   category

>for example:
>
>u752359   2004-04-02 12:33:04   3403   http://www.mysql.com/index.html
business
>

To me that implies that you have a standard text log where each field is
separated by a space and rows are delimited by a CRLF pair. To get that log
into MySQL you will need to do something _like_ (I say "like" as you will
most likely need to tweak it to accommodate your actual data)

CREATE TABLE bulk_table (
      user varchar(10)
      , eventdate date
      , eventtime time
      , size int
      , url text
      , category varchar(50)
);

LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED BY '
' LINES TERMINATED BY '\r\n';

Now we can start merging the logs into your data tables.

To quote RFC 2396:
   The URI syntax does not require that the scheme-specific-part have
   any general structure or set of semantics which is common among all
   URI.  However, a subset of URI do share a common syntax for
   representing hierarchical relationships within the namespace.  This
   "generic URI" syntax consists of a sequence of four main components:

      <scheme>://<authority><path>?<query>

   each of which, except <scheme>, may be absent from a particular URI.

That translates into 4 logical pieces we can split a URL into:
1) the scheme -- HTTP, FTP, GOPHER, etc...
2) the authority -- that is the server (www.yahoo.com) and any port numbers
or login information
3) the path -- /somefolder/somefile.whatever
4) the query -- everything after the ?

Not part of the generic URI is that bit known as a fragment (as identified
by the #) it is ALWAYS at the end of the _entire URL_ (including the query)
when it's used. I have always lumped those and queries into the same field.

So you could create tables for each of those parts and get VERY normalized
or you can partly normalize like this:

ALTER TABLE bulk_table add server_split int not null default 0
      , add path_split int not null default 0
      , add server varchar(255)
      , add path varchar(255)

UPDATE bulk_table
set server_split = LOCATE('/', URL , 8)-1;

UPDATE bulk_table
SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?', URL,
server_split),  LOCATE('#', URL, server_split)-1);

UPDATE bulk_table
set server=LEFT(URL, server_split )

#those 4 new columns helped us to parse out the 3 major parts of the url
#I added them to the table so that we would not have to keep recalculating
those values later on
# if it turns out that adding the columns takes a LONG time, we can create
this table with those columns
# already created and just not import to them (change the LOAD DATA INFILE
statement slightly)

CREATE TABLE IF NOT EXISTS url_servers (
ID int not null auto_increment,
server varchar(255) primary key,
Key (ID)
)

CREATE TABLE IF NOT EXISTS url_paths (
ID int not null auto_increment,
path varchar(255) primary key,
Key (ID)
)

INSERT IGNORE INTO url_servers(server)
SELECT DISTINCT server
FROM bulk_table

INSERT IGNORE INTO url_paths (path)
SELECT DISTINCT path
FROM bulk_table

# at this point we have all of our new Servers and our Paths uniquely
numbered
# but we are going to need a slightly different URL table to track visits.

CREATE TABLE url_visit (
urlid mediumint not null auto_increment primary key,
url_server_ID int not null default 0,
url_path_ID int not null default 0,
querystring text default null,
category varchar(50)default null,
KEY(url_server_ID, url_path_ID)
)

## that last key is to speed up our joins to our _servers and _paths
tables...

#  we finally have enough information to insert to the visit table

INSERT url_visit (url_server_ID, url_path_ID, querystring, category)
SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path), NULL),
bt.category
FROM bulk_table bt
INNER JOIN url_servers us
      ON us.server = bt.server
INNER JOIN url_paths up
      on up.path=bt.path

## (see where the new pre-computed columns come in handy?) :-D

# and now we have enough information to load the internet_usage table.
Though there isn't enough data in your sample
# source data to fill in all of the columns

INSERT internet_usage ( uid,`time`,urlid, size)
SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
FROM bulk_table bt
INNER JOIN url_servers us
      ON us.server = bt.server
INNER JOIN url_paths up
      ON up.path=bt.path
INNER JOIN url_visit uv
      ON uv.url_server_ID = us.ID
      AND uv.url_path_ID = up.id
      AND uv.querystring = if(bt.path_split >0, SUBSTRING(bt.url,path),
NULL)


It may not be perfect but it's how I would do it. I am sure that others out
there would have a twist or two to put on the process.  When it's all over
and done, clear out the bulk_table and do it all over again.

I would not mix a statistic, like hit count, into the url_visit table. This
next query should be blazingly fast as both columns are part of a compound
index.

SELECT url_server_ID, url_path_ID, count(1)
from url_visit
group by url_server_ID,url_path_Id;

-OR- if you would rather see full names:

create temporary table tmpStats
SELECT url_server_ID, url_path_ID, count(1) as hits
from url_visit
group by url_server_ID,url_path_Id;

select concat(us.server,up.path), ts.hits
from tmpStats
inner join url_servers us
on us.id = ts.url_server_ID
inner join url_paths up
on up.id = ts.url_path_ID

drop table tmpStats;

(NOTE: I used a temp table to aggregate on the ID values before resolving
to the names because resolving those BEFORE running the GROUP BY would have
eliminated the use of an existing index.)

Well, I have to get back to work. Let me know how this works for you, OK?

Cheers,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine


------------all previous responses clipped for space ---------------------




-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: load data into 2 tables and set id

Reply via email to