J S,

Check to see if the url_scheme_ID part of a unique constraint/key or the
Primary Key? If it isn't, you will get dupes even if you use INSERT IGNORE
unless you pre-screen your INSERTs some other way to avoid duplication.

Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine





                                                                                       
                                
                      "J S"                                                            
                                
                      <[EMAIL PROTECTED]        To:       [EMAIL PROTECTED]            
                                 
                      com>                     cc:       [EMAIL PROTECTED]             
                            
                                               Fax to:                                 
                                
                      06/23/2004 04:13         Subject:  Re: load data into 2 tables 
and set id                        
                      AM                                                               
                                
                                                                                       
                                
                                                                                       
                                




Shawn,

I uncovered a problem this morning. I wonder if you (or anyone else) can
help me out again?

mysql> select * from url_visit where url_scheme_ID=3 limit 10;
+---------+---------------+---------------+-------------+-------+----------+

| urlid   | url_scheme_ID | url_server_ID | url_path_ID | query | category
|
+---------+---------------+---------------+-------------+-------+----------+

|   23392 |             3 |          1070 |       22221 | NULL  | none
|
| 1346269 |             3 |          1070 |       22221 | NULL  | none
|
+---------+---------------+---------------+-------------+-------+----------+

2 rows in set (0.00 sec)

This is the insert statement I'm using:

INSERT IGNORE url_visit (url_scheme_ID, url_server_ID, url_path_ID, query,
category)
                         SELECT DISTINCT uc.ID, us.ID,
up.ID,bt.query,bt.category
                         FROM bulk_table bt
                         INNER JOIN url_servers us ON us.server=bt.server
                         INNER JOIN url_paths up ON up.path=bt.path
                         INNER JOIN url_schemes uc ON uc.scheme=bt.scheme

Do I  need brackets after the distinct? e.g
SELECT DISTINCT (uc.ID, us.ID, up.ID,bt.query,bt.category)

Thanks,

js.

>
>
>Great catch! I believe you found the problem. Sorry about that!   ;-D
>
>So... just curious... from the original 60GB of text data, how much space
>did the final data require (including indices)?
>
>Yours,
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>
>
>
>
>                       "J S"
>                       <[EMAIL PROTECTED]        To:
[EMAIL PROTECTED]
>                       com>                     cc:
>[EMAIL PROTECTED]
>                                                Fax to:
>                       06/22/2004 07:55         Subject:  Re: load data
>into 2 tables and set id
>                       AM
>
>
>
>
>
>
>I think I fixed it!
>
>INSERT IGNORE url_visit (url_server_ID, url_path_ID, querystring,
>category)
>SELECT DISTINCT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
>NULL),
>bt.category
>FROM bulk_table bt
>INNER JOIN url_servers us
>        ON us.server = bt.server
>INNER JOIN url_paths up
>        on up.path=bt.path
>
>After this I ran the sql to insert into internet_usage and this time it
>finished after 10m32.
>
>I'll do some more testing and let you know how it goes.
>
>Thanks,
>
>js.
> >
> >Did you mean there to be duplicates in the url_visits? Do I need to use
> >IGNORE in the following SQL?
> >
> >INSERT url_visit (url_server_ID, url_path_ID, querystring,
> >category)
> >SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
> >NULL),
> >bt.category
> >FROM bulk_table bt
> >INNER JOIN url_servers us
> >       ON us.server = bt.server
> >INNER JOIN url_paths up
> >       on up.path=bt.path
> >
> >js.
> >
> >mysql> select * from url_visit where urlid=1631;
>
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>
> >| urlid | url_server_ID | url_path_ID | query
>
> >
>
> >     | category     |
>
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>
> >|  1631 |         21720 |      630695 |
>
>cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift

>
> >| Online Sales |
>
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>
> >1 row in set (0.01 sec)
> >
> >mysql> select * from url_visit where urlid=1753;
>
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>
> >| urlid | url_server_ID | url_path_ID | query
>
> >
>
> >     | category     |
>
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>
> >|  1753 |         21720 |      630695 |
>
>cid=mrkbaki&src=catcmrkdswi|p1|rshop/catcmrkseai|p1|rshop/catcmrkhori|p1|rshop/catcmrki|p1|rshop/catcripa|p1|rgift

>
> >| Online Sales |
>
>+-------+---------------+-------------+--------------------------------------------------------------------------------------------------------------------+--------------+

>
> >1 row in set (0.00 sec)
> >
> >>
> >>Shawn,
> >>
> >>Thanks for your reply below. I found it extremely useful. I have
>followed
>
> >>your instructions and got good results up to the url_visits table.
> >>
> >>I have a perl script to parse the values out of the log. The log has
> >>3,770,246 lines and is gzipped. I then applied your SQL statements with
> >>the following times:
> >>
> >>1. Parse script: 10m31
> >>2. Load infile: 2m39
> >>3. Insert url_servers and url_paths: 5m38
> >>4. Insert url_visits 9m50
> >>
> >>However when I ran the sql to insert the internet usage table, it just
>ran
> >>until all the log space was filled up:
> >>
> >>-rw-rw----   1 mysql    mysql    379694680 Jun 22 10:05 bulk_table.MYD
>
> >>        <====
> >>-rw-rw----   1 mysql    mysql       1024 Jun 22 10:05 bulk_table.MYI
> >>-rw-rw----   1 mysql    mysql       8760 Jun 22 09:59 bulk_table.frm
> >>-rw-rw----   1 mysql    mysql    2114977792 Jun 22 11:11
> >>internet_usage.MYD     <====
> >>-rw-rw----   1 mysql    mysql       1024 Jun 22 10:34
internet_usage.MYI
> >>-rw-rw----   1 mysql    mysql       8646 Jun 22 09:59
internet_usage.frm
> >>-rw-rw----   1 mysql    mysql    33437600 Jun 22 10:23 url_paths.MYD
> >>-rw-rw----   1 mysql    mysql    27696128 Jun 22 10:23 url_paths.MYI
> >>-rw-rw----   1 mysql    mysql       8574 Jun 22 09:59 url_paths.frm
> >>-rw-rw----   1 mysql    mysql     646160 Jun 22 10:19 url_servers.MYD
> >>-rw-rw----   1 mysql    mysql     796672 Jun 22 10:19 url_servers.MYI
> >>-rw-rw----   1 mysql    mysql       8578 Jun 22 09:59 url_servers.frm
> >>-rw-rw----   1 mysql    mysql    119076844 Jun 22 10:32 url_visit.MYD
> >>-rw-rw----   1 mysql    mysql    73026560 Jun 22 10:33 url_visit.MYI
> >>-rw-rw----   1 mysql    mysql       8694 Jun 22 09:59 url_visit.frm
> >>
> >>I'm sure something's not right because the internet Usage table is
>bigger
>
> >>than the bulk table. I changed your sql a bit (on the last line with
the
> >>query). Could this be the problem?
> >>
> >>INSERT internet_usage (uid,time,ip,urlid,size)
> >>SELECT bt.user, bt.time, bt.ip, uv.urlid, bt.size
> >>FROM bulk_table bt
> >>INNER JOIN url_servers us
> >>        ON us.server=bt.server
> >>INNER JOIN url_paths up
> >>        ON up.path=bt.path
> >>INNER JOIN url_visit uv
> >>        ON uv.url_server_ID=us.ID
> >>        AND uv.url_path_ID=up.ID
> >>        AND uv.query=bt.query;
> >>
> >>Thanks again,
> >>
> >>js.
> >>
> >>>
> >>>-------- welcome to a basic overview of bulk importing and normalizing
>as
> >>>you go ------------
> >>>
> >>>[ author's note: if you are seeing this thread for the first time and
> >>>certain items seem to be
> >>>introduced out of context, please review all previous posts in this
> >>>thread.
> >>>There has been
> >>>a lot of information already exchanged on this topic that I clipped
> >>>out of this response. Thanks! -- SG]
> >>>
> >>>In an earlier post you said
> >>> >The data values I have for each record are:
> >>> >
> >>> >user_id   date_time   size   url   category
> >>>
> >>> >for example:
> >>> >
> >>> >u752359   2004-04-02 12:33:04   3403
>http://www.mysql.com/index.html
> >>>business
> >>> >
> >>>
> >>>To me that implies that you have a standard text log where each field
>is
> >>>separated by a space and rows are delimited by a CRLF pair. To get
that
> >>>log
> >>>into MySQL you will need to do something _like_ (I say "like" as you
>will
> >>>most likely need to tweak it to accommodate your actual data)
> >>>
> >>>CREATE TABLE bulk_table (
> >>>       user varchar(10)
> >>>       , eventdate date
> >>>       , eventtime time
> >>>       , size int
> >>>       , url text
> >>>       , category varchar(50)
> >>>);
> >>>
> >>>LOAD DATA INFILE <a proxy log file> INTO bulk_table FIELDS TERMINATED
>BY
>
> >>>'
> >>>' LINES TERMINATED BY '\r\n';
> >>>
> >>>Now we can start merging the logs into your data tables.
> >>>
> >>>To quote RFC 2396:
> >>>    The URI syntax does not require that the scheme-specific-part have
> >>>    any general structure or set of semantics which is common among
all
> >>>    URI.  However, a subset of URI do share a common syntax for
> >>>    representing hierarchical relationships within the namespace.
This
> >>>    "generic URI" syntax consists of a sequence of four main
>components:
> >>>
> >>>       <scheme>://<authority><path>?<query>
> >>>
> >>>    each of which, except <scheme>, may be absent from a particular
>URI.
> >>>
> >>>That translates into 4 logical pieces we can split a URL into:
> >>>1) the scheme -- HTTP, FTP, GOPHER, etc...
> >>>2) the authority -- that is the server (www.yahoo.com) and any port
> >>>numbers
> >>>or login information
> >>>3) the path -- /somefolder/somefile.whatever
> >>>4) the query -- everything after the ?
> >>>
> >>>Not part of the generic URI is that bit known as a fragment (as
> >>>identified
> >>>by the #) it is ALWAYS at the end of the _entire URL_ (including the
> >>>query)
> >>>when it's used. I have always lumped those and queries into the same
> >>>field.
> >>>
> >>>So you could create tables for each of those parts and get VERY
> >>>normalized
> >>>or you can partly normalize like this:
> >>>
> >>>ALTER TABLE bulk_table add server_split int not null default 0
> >>>       , add path_split int not null default 0
> >>>       , add server varchar(255)
> >>>       , add path varchar(255)
> >>>
> >>>UPDATE bulk_table
> >>>set server_split = LOCATE('/', URL , 8)-1;
> >>>
> >>>UPDATE bulk_table
> >>>SET path_split = if(LOCATE('?', URL, server_split)> 0, LOCATE('?',
URL,
> >>>server_split),  LOCATE('#', URL, server_split)-1);
> >>>
> >>>UPDATE bulk_table
> >>>set server=LEFT(URL, server_split )
> >>>
> >>>#those 4 new columns helped us to parse out the 3 major parts of the
>url
> >>>#I added them to the table so that we would not have to keep
> >>>recalculating
> >>>those values later on
> >>># if it turns out that adding the columns takes a LONG time, we can
> >>>create
> >>>this table with those columns
> >>># already created and just not import to them (change the LOAD DATA
> >>>INFILE
> >>>statement slightly)
> >>>
> >>>CREATE TABLE IF NOT EXISTS url_servers (
> >>>ID int not null auto_increment,
> >>>server varchar(255) primary key,
> >>>Key (ID)
> >>>)
> >>>
> >>>CREATE TABLE IF NOT EXISTS url_paths (
> >>>ID int not null auto_increment,
> >>>path varchar(255) primary key,
> >>>Key (ID)
> >>>)
> >>>
> >>>INSERT IGNORE INTO url_servers(server)
> >>>SELECT DISTINCT server
> >>>FROM bulk_table
> >>>
> >>>INSERT IGNORE INTO url_paths (path)
> >>>SELECT DISTINCT path
> >>>FROM bulk_table
> >>>
> >>># at this point we have all of our new Servers and our Paths uniquely
> >>>numbered
> >>># but we are going to need a slightly different URL table to track
> >>>visits.
> >>>
> >>>CREATE TABLE url_visit (
> >>>urlid mediumint not null auto_increment primary key,
> >>>url_server_ID int not null default 0,
> >>>url_path_ID int not null default 0,
> >>>querystring text default null,
> >>>category varchar(50)default null,
> >>>KEY(url_server_ID, url_path_ID)
> >>>)
> >>>
> >>>## that last key is to speed up our joins to our _servers and _paths
> >>>tables...
> >>>
> >>>#  we finally have enough information to insert to the visit table
> >>>
> >>>INSERT url_visit (url_server_ID, url_path_ID, querystring, category)
> >>>SELECT us.ID, up.ID, if(bt.path_split >0, SUBSTRING(bt.url,path),
>NULL),
> >>>bt.category
> >>>FROM bulk_table bt
> >>>INNER JOIN url_servers us
> >>>       ON us.server = bt.server
> >>>INNER JOIN url_paths up
> >>>       on up.path=bt.path
> >>>
> >>>## (see where the new pre-computed columns come in handy?) :-D
> >>>
> >>># and now we have enough information to load the internet_usage table.
> >>>Though there isn't enough data in your sample
> >>># source data to fill in all of the columns
> >>>
> >>>INSERT internet_usage ( uid,`time`,urlid, size)
> >>>SELECT bt.user, ADDTIME(bt.date, bt.time), uv.urlid, bt.size
> >>>FROM bulk_table bt
> >>>INNER JOIN url_servers us
> >>>       ON us.server = bt.server
> >>>INNER JOIN url_paths up
> >>>       ON up.path=bt.path
> >>>INNER JOIN url_visit uv
> >>>       ON uv.url_server_ID = us.ID
> >>>       AND uv.url_path_ID = up.id
> >>>       AND uv.querystring = if(bt.path_split >0,
>SUBSTRING(bt.url,path),
> >>>NULL)
> >>>
> >>>
> >>>It may not be perfect but it's how I would do it. I am sure that
others
> >>>out
> >>>there would have a twist or two to put on the process.  When it's all
> >>>over
> >>>and done, clear out the bulk_table and do it all over again.
> >>>
> >>>I would not mix a statistic, like hit count, into the url_visit table.
> >>>This
> >>>next query should be blazingly fast as both columns are part of a
> >>>compound
> >>>index.
> >>>
> >>>SELECT url_server_ID, url_path_ID, count(1)
> >>>from url_visit
> >>>group by url_server_ID,url_path_Id;
> >>>
> >>>-OR- if you would rather see full names:
> >>>
> >>>create temporary table tmpStats
> >>>SELECT url_server_ID, url_path_ID, count(1) as hits
> >>>from url_visit
> >>>group by url_server_ID,url_path_Id;
> >>>
> >>>select concat(us.server,up.path), ts.hits
> >>>from tmpStats
> >>>inner join url_servers us
> >>>on us.id = ts.url_server_ID
> >>>inner join url_paths up
> >>>on up.id = ts.url_path_ID
> >>>
> >>>drop table tmpStats;
> >>>
> >>>(NOTE: I used a temp table to aggregate on the ID values before
>resolving
> >>>to the names because resolving those BEFORE running the GROUP BY would
> >>>have
> >>>eliminated the use of an existing index.)
> >>>
> >>>Well, I have to get back to work. Let me know how this works for you,
>OK?
> >>>
> >>>Cheers,
> >>>Shawn Green
> >>>Database Administrator
> >>>Unimin Corporation - Spruce Pine
> >>>
> >>>
> >>>------------all previous responses clipped for space
> >>>---------------------
> >>>
> >>>
>
>
>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]
>

_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now!
http://toolbar.msn.co.uk/







-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to