JS,

I need one more piece of information to help make that query work for you,
I need the structure of the table that you use (or would use) to bulk
import those logs into.

If you are running out of room you may consider further normalizing you
data (which saves space, but creates more data maintenance steps).  I, too,
have had to deal with millions of rows of internet usage logs so I
understand your pain.

You can store your IP addresses in an INT and get them back in dotted
notation with the MySQL functions INET_ATON() and INET_NTOA().  That will
save you an average of 10 bytes PER ROW (it adds up when you are into the
millions of rows).

Create a Proxy_Action table (id tinyint, action varchar(20) ) and populate
it with all of the actions your proxy/firewall can log.  Then replace the
column "action varchar(20)" with "ProxyAction_id tinyint".  (I assume there
are less than 256 "action" messages available from your proxy?) That's 1
byte vs. 8 at the low end of your sample data (counting the null at the end
of the string).  First add the new column to the table, populate it with
the ID values from your new Action table, then drop the old column.

URLs consist (in a basic sense) of the server portion (to the left of the
first  single / ) and the path portion (right of the first / and left of a
? or #) and either a fragment (after the #) or a query string (after the ?)
I would at least split out the server portion into it's own table. For each
page request (assume 1 page and 9 pictures, all from the same server) that
would be 10 rows of log data that all contain the same chunk of similar
information. Reducing that heavily repeated portion of your data to an ID
number will greatly help reduce the size of your database.

About the non-uniqueness of your internet_usage table.  Even if the same
user visits the same URL multiple times (is that what you mean by repeated
records?) the times should all be slightly different.  If they are not
different, it is still possible that the same person requested the same
page twice or more during the same second (the auto-login feature of MSIE
comes to mind as one culprit). OR you could have multiple users all on the
same userID hitting the same page from different machines.... I guess I
would have to see more data to understand your problem better.

Yours,
Shawn Green
Database Administrator
Unimin Corporation - Spruce Pine



                                                                                       
                                 
                      "J S"                                                            
                                 
                      <[EMAIL PROTECTED]        To:       [EMAIL PROTECTED]            
                                  
                      com>                     cc:       [EMAIL PROTECTED]             
                             
                                               Fax to:                                 
                                 
                      06/18/2004 09:40         Subject:  Re: load data into 2 tables 
and set id                         
                      AM                                                               
                                 
                                                                                       
                                 
                                                                                       
                                 




Shawn,

Thanks for helping on this. I really appreciate it.

>No problem!!
>
>Please post the structures of your "big_table" and your "url_table"
>(whatever you called them)  and I will help you to rewrite step 4 to count
>how many times a URL appears in the "big_table".

mysql> desc internet_usage;
+--------+-------------+------+-----+---------+-------+
| Field  | Type        | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| uid    | varchar(10) | YES  | MUL | NULL    |       |
| time   | datetime    | YES  |     | NULL    |       |
| ip     | varchar(20) | YES  |     | NULL    |       |
| action | varchar(20) | YES  |     | NULL    |       |
| urlid  | int(11)     | YES  |     | NULL    |       |
| size   | int(11)     | YES  |     | NULL    |       |
+--------+-------------+------+-----+---------+-------+
6 rows in set (0.03 sec)

mysql> desc url_table;
+----------+--------------+------+-----+---------+----------------+
| Field    | Type         | Null | Key | Default | Extra    |
+----------+--------------+------+-----+---------+----------------+
| urlid    | mediumint(9) |      | PRI | NULL    | auto_increment |
| url      | text         | YES  | MUL | NULL    |                |
| hits     | mediumint(9) | YES  |     | NULL    |                |
| category | varchar(50)  | YES  |     | NULL    |                |
+----------+--------------+------+-----+---------+----------------+
4 rows in set (0.01 sec)

mysql>

>
>Is this bulk import a process you need to do repeatedly? if so we need to
>worry about updating the count column on the next batch import and not
just
>re-creating it from the new data. That will change the query
significantly.

Yes it will be a repeated process. Actually I have a backlog of 6 months
data to load!

Here's an example of what the data looks like:

mysql> select * from internet_usage limit 5;
+---------+---------------------+----------------+--------------+-------+------+

| uid     | time                | ip             | action       | urlid |
size |
+---------+---------------------+----------------+--------------+-------+------+

| n58396  | 2004-06-07 21:12:16 | 21.38.25.204   | TCP_TUNNELED |  5999 |
5297 |
| u344584 | 2004-06-07 21:07:12 | 21.33.136.74   | TCP_HIT      |  4494 |
438 |
| -       | 2004-06-07 21:07:02 | 21.38.92.76    | TCP_NC_MISS  |  2134 |
771 |
| u524797 | 2004-06-07 21:03:27 | 21.32.25.41    | TCP_NC_MISS  |   260 |
582 |
| -       | 2004-06-07 21:09:13 | 21.201.130.240 | TCP_HIT      |  3112 |
542 |
+---------+---------------------+----------------+--------------+-------+------+


mysql> select * from url_table limit 5;
+-------+-----------------------------------------------------------------------------------+------+---------------+

| urlid | url

                | hits | category      |
+-------+-----------------------------------------------------------------------------------+------+---------------+

|     1 |
http://www.bbc.co.uk/horoscopes/chinesehoroscopes/images/hp-snake.gif |
NULL
| Entertainment |
|     2 | http://www.call18866.co.uk/images/logo.jpg

                | NULL | none          |
|     3 | http://m2.doubleclick.net/866421/0409_santoku_250.gif

                | NULL | none          |
|     4 | http://lysto1-dc02.ww.ad.ba.com/

                | NULL | Travel        |
|     5 | http://www.aboutscotland.com/edin/newstreetph/sitview.jpg

                | NULL | Travel        |
+-------+-----------------------------------------------------------------------------------+------+---------------+

5 rows in set (0.00 sec)


One other problem I'm having here is making the rows in internet_usage
unique. At the moment I have lots of duplicates, and I was trying to create

a temporary table but unfortunately got an error 27 (I think this refers to

a 2GB limit).

mysql> CREATE TEMPORARY TABLE bad_temp AS SELECT DISTINCT * FROM
internet_usage;
ERROR 1030: Got error 27 from table handler

Is there another way of doing this?

>
>Yours,
>Shawn Green
>Database Administrator
>Unimin Corporation - Spruce Pine
>
> >
> >Mos forgot to populate the url_id column in your user table. I would use
> >his same process but re-arrange it like this:
> >
> >1) create table BIG_TABLE....
> >2) load data infile....
> >3) create table URL_TABLE (
> >       url_id bigint not null auto_increment,
> >       url varchar(25) not null primary key,
> >       category ....
> >       )
> >4) INSERT IGNORE URL_TABLE (url, category)
> >       SELECT url,category
> >       FROM BIG_TABLE
> >4) create table USER_TABLE (
> >       user_id varchar?(...) primary key,
> >       date_time datetime,
> >       url_id bigint,
> >       size int
> >       )
> >
> >5) insert USER_TABLE
> >       SELECT bt.userID, bt.datetime, u.url_id, bt.size
> >       FROM BIG_TABLE bt
> >       INNER JOIN URL_TABLE u
> >       ON u.url = bt.url
> >
> >doing it this way lets the SQL engine handle the conversion of URLs to
> >their IDs in the USER_TABLE...see? No scripting required at all!
> >
> >Yours,
> >Shawn Green
> >Database Administrator
> >Unimin Corporation - Spruce Pine
> >
> >
>
>Shawn,
>
>Many thanks for your reply. It sounds like this might be quicker than my
>perl script which parses a 1GB log file and fills the database in 40 mins
>at
>the mo. (The actual parsing only takes 1.5 mins).
>There's one snag which I forgot about and that's in the url_table I have
>another column called hits which is the number of hits for each url.  I'm
>terrible at SQL and wondered if you might be able to suggest a way of
doing
>
>this with the method above?
>
>Thanks,
>
>JS.
>
>_________________________________________________________________
>It's fast, it's easy and it's free. Get MSN Messenger today!
>http://www.msn.co.uk/messenger
>
>
>
>
>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]
>

_________________________________________________________________
Want to block unwanted pop-ups? Download the free MSN Toolbar now!
http://toolbar.msn.co.uk/







-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Reply via email to