[htdig3-dev] Re: Question related to Mysql patch for htdig-3.1.5 (02.03.2000)

Zoran Constantinescu-Fulop Sat, 28 Oct 2000 08:56:38 -0700

Hi

> I would like to start by thanking you for the work you did on the Mysql
> patch for htdig-3.1.5.  I am using it in an htdig setup that I am working
> on and using Mysql as the back-end is great!
> 
Nice to hear that :)

> My next step in my project is that I want to be able to run multiple
> copies of htdig concurrently, all indexing into the same databases.
> 
> One idea for doing that was having each instance write words into its own
> TMPLINKS table -- TMPLINKS01, TMPLINKS02, TMPLINKS03, etc.
> 
> The other idea that I had was to remove the TMPLINKS table altogether and
> write words directly into the LINKSxx tables.  I found htlib/SQL_db.cc to
> have code, commented out, that inserted directly into the LINKSxx tables.  
> I was hoping that you could shed some light on why you chose to create the
> TMPLINKS table and what the negative repercussions are, if there are any,
> to inserting directly into the LINKSxx tables.
> 
> Thanks in advance for any help you can provide.
> 
I'll try to explain below what is with all these LINKS, TMPLINKS and
LINKS00-LINKS99 tables. It all started when I used htdig w/ MySQL to 
index big web sites, resulting in databases larger than 1 GByte :)

In the first implementation I made there was only one LINKS table, and
no TMPLINKS. The LINKS table is used to make the connection between a 
url document (DOC_ID) and a word (WORD_ID), so it is used both by 
htdig and htsearch and is indexed. Each time you add a record to it,
the index is updated. The problem is that this table has a very big
number of records. If I insert word by word, then the performance is
very poor, resulting in a lot of indexing work for the Mysql server. 

There is a option of "locking" a table. This means that after locking, the
records added are not indexed. The indexing is performed after un-locking
the table. This results in a much better performance. The problem is that
during a lock session, there is no possibility for searching (using
htsearch) :(. 

So my idea was to use some temporary TMPLINKS combined with the
Lock/Unlock possibilities of MySQL. You will see that this table 
(TMPLINKS) is identical to LINKS, except it has no indexes. Another
optimization comes from splitting the LINKS table into the LINKS00..
LINKS99 tables. The records having:
  WORD_ID=0, 100, 200 etc. go to LINKS00, 
  WORD_ID=1, 101, 201 etc. go to LINKS01
and so on. It is a simple hashing based on the word_id field. This
reduces the size of the LINKS table, but increases the number of 
tables used by the SQL server. 

The SQL_db::AddLink() function writes the records into the TMPLINKS
table. The function SQL_db::CopyTmpLinks() just copies all the records
from TMPLINKS into the appropriate LINKSxx tables. In the current 
implementation, this function is called afted parsing each document, 
see WordList:Flush(). However, if the value of the SQL_tmplinks
variable in the htdig.conf file is different from "flush", then the
records remain in the TMPLINKS table and are not copied to the LINKSxx
tables (the CopyTmpLinks() function does nothing). One possibility
here is to copy them after the htdig idexing is finished running.
Or perharps once a day. The main problem here is that the copying,
and indexing of the tables slows down htsearch's acceses to the 
database.

The htsearch program uses only the LINKS00-99 tables, and does not
use TMPLINKS at all. The LINKS table is not anymore used by any of 
the programs (I used in the early versions, and it remained there :).

As you can see, all these tricks with the *LINKS* tables is justified
by the speed, especially when dealing with large volumes of data. I
personaly used htdig/SQL with a SQL database of about 1.5 GBytes :)
And it worked quite nice. 

Now, returning to the problem of running multiple copies of htdig,
each indexing in the same database. If you put the SQL_tmplinks to
"noflush", than the concurrent digging should work with no problems.
The only thing you will have to do is to copy afterwards the records
from TMPLINKS into LINKSxx. This can be done making a small perl 
script or C++ program (calling the appropriate SQL_db:: functions).
One further optimization might be _not_ to make a "TMPLINKS write"
lock in the WordList::Flush() function, since nobody will delete
records from this table and the table is not indexed at all. This
would allow concurrent write accesses to this table. Otherwise it
would possible that one instance of htdig will block, waiting for
another instance to finish writing to this table.

Another scenario is if you need to search using htsearch while indexing
documents with multiple instances of htdig. You will need to put
SQL_tmplinks to "flush". Each of the htdig instances will write 
records into the same TMPLINKS table, but then they will need to use
this table exclusively! According to the locking sequence in the
WordList::Flush() function, we have: 

    dbf->DelTmpLinks ();
    dbf->Lock (sTable);

    .. write links .. (AddLink())
 
    dbf->UnLock ();
    dbf->CopyTmpLinks ();

Unfortunately, this could generate some problems when multiple instances
are running accessing the same database. The solution should be something
like:

    dbf->Lock (sTable);
    dbf->DelTmpLinks ();

    .. write links .. (AddLink())
 
    dbf->CopyTmpLinks ();
    dbf->UnLock ();

This would make sure that while one instance is using TMPLINKS, nobody
else is using it.

There should be a discussion on the problem if the TMPLINKS is needed
or not in this case. The advantage of using it is that, since the
LINKSxx tables are indexed, when copying from TMPLINKS, only one
re-indexing per table is necessary! (we copy all records into one
LINKSxx using only one select command). If we write each record directly 
into LINKSxx, then for each write it will be done a re-index. One
way of getting rid of this is to sort the records into the memory,
lock one of the LINKSxx tables, then inserting all records into this
table, then unlocking it. This can avoid using the TMPLINKS table,
and I think it is possible to obtain some better performance.

I hope this (little :) description of how the SQL implementation works
will help you. 


Cheers,
--zoran

http://www.idi.ntnu.no/~zoran



------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this.
[htdig3-dev] Re: Question related to Mysql patch for htdig-3.1.5 (02.03.2000)

Reply via email to