Hi,

Since 3.2 betas have started, I have been growingly concerned with
indexing speed (htdig 3.1.5 was indexing the small intranet site at my
workplace much faster much : 1H. Whereas 3.2b2 takes many hours.).
 
 So I started to think about inverted indexes. And I wondered how
computing intensive it would be to take the most
brutal approach : 1 file per word, 2 integers per occurence of the word
(document_id,position_in_document). 
This is indeed brutal, indexing the LDP directory of my SUSE Linux
install (50MB of text, more than 40 issues of Linux Gazette) created
more than 70000 files, some of them being more that 2MB long (e.g. the
file for 'a'). This is more brutal than the approach of the new mifluz
version (one file, and triplets (word_id, doc_id, position_in_document))
which is a big improvement over htdig's way (and indeed that is what I
recommend for unfrequent words).

 Since changing Ht://dig is complicated (one needs to change the DB
format, htdig, htmerge, htsearch), I decided to write
a simplistic Python program to try my idea. I made the following choices
:
  - I decided NOT to use a better approach for unfrequent words (i.e.
put them in one file in the mifluz format) for the sake of simplicity.
  - I did not parse the HTML or use stop words so my program indexes ALL
words and TAGS (i.e. 'a', 'href' and 'linux' are all indexed)
  - I put all the HTML files in one directory and did not mimic the
spider work of htdig since that was not relevant to my test.

Then I did some benchmarks which I deem interesting despite the
following caveats :
  - python is MUCH slower than C++ (my simplistic parsing can barely
parse 6Mbytes per minute !)
  - htdig has to connect to the local httpd to get its files
  - htdig is very configurable which must cause some overhead difficult
to assess.
  - testing takes time and I do this during my evenings
  - my simplistic program indexes many more occurrences than htdig (all
tags, all 1-letter and 2-letter words and common english words too)
  - it is necessary to use reiserfs to improve speed (on a small 12MB
data set reiserfs makes the indexing process almost 7 times faster)

 Let's see the results with reiserfs on the 50MB data set (Bi-Celeron
466MHZ, 128MB of RAM and 20GB UDMA disk) :
   - htdig : 206 minutes
   - dig.py 0.2 (with subdirs, reiserfs) : 35 minutes
   - dig.py 0.3 (no subdirs, reiserfs) : 62 minutes

  This shows that the brutal way could be usable even if you use it for
all words instead of keeping it for
  very frequent words such as 'linux' or 'the'. But the difference
between the two version (0.2 being the best one, since it puts
word-files in subdirectories according to their first character :
'linux' is in subdir 'l') shows that
when you use it for too many unfrequent words it becomes less efficient,
so that one really would want to detect locally rare words (which are
very numerous) and use a different technic for them.

 On a smaller testset I compared reiserfs and ext2 (both with 4K blocks,
which is not the best solution for dig.py), the results of the subdir
and no subdirs version were similar :
  - reiserfs : 5min (the version with subdirs had less than a 2 seconds
difference)
  - ext2  : 20min (or 23 without subdirs)
 Please note that if you don't write the results the test takes 3min
(parsing and sorting). So that you can see 
that the reiserfs/ext2 ratio is more like 2min/17min... This filesystem
really seems great.

I have not benchmarked searching but I don't need to argue that going
through a sequential file to retrieve all occurences of a given word is
faster that getting them from a berkeley DB file (and it is easy to code
too, I wrote
a simple phrase-searching program but python is really slow for that
kind of thing).

So here are my opinions :
 - handling occurrences as mifluz is said to do it (I have not checked
yet, so please don't flame if I did not understand the change list
comments) would be a big improvement other htdig. It is the best
solution for rare words
 - the brutal file-based approach can be used for frequent words since
there are few of them. In fact if you combine this approach with the
mifluz way (for locally unfrequent words), you don't even need reiserfs
to get good performance.
 - the current implementation in 3.2 is not usable at my workplace
because of indexing speed. So that I hope htdig will at least do things
the mifluz way.

 I hope that you found some interest in this message,

        Sincerely,

        Pierre-Olivier Gaillard
#!/usr/bin/python
import os
import os.path
import sys
import shlex
import re
import struct 
import string
import time

doc_id = 0
DB_DIR="./db"
OCCURRENCE_FORMAT="II"

word_re = re.compile("([a-zA-Z0-9_�����]+)")

def get_file_name_for(word):
    return DB_DIR + "/" + word[0] + "/" + word

def dump_occurrences_from_file(file_path):
    my_file = open(file_path,"r+b")
    while 1:
        occur_string = my_file.read(8)
        if len(occur_string) == 8:
            print struct.unpack(OCCURRENCE_FORMAT,occur_string)
        else:
            return


def write_occurrences_for_word(word, occurrence_list):
    
    try:
        output_file = open( get_file_name_for(word),"ab")
    except:
        os.makedirs(os.path.dirname(get_file_name_for(word)))
        output_file = open( get_file_name_for(word),"ab")
    for occurrence in occurrence_list:
        output_file.write(struct.pack(OCCURRENCE_FORMAT,
                                      occurrence[1],
                                      occurrence[2]
                                      )
                          )
    output_file.close()    



def write_occurrences(occur_list):
    """Write occurrences of words to the corresponding files
       in binary mode (format : docid (4Bytes) position (4Bytes)) """
    current_word=""
    same_list=[]
    for occurrence in occur_list:
        if current_word != occurrence[0]:
            if len(current_word) > 0:
                write_occurrences_for_word(current_word,same_list)
            # End of same_list writing
            current_word = occurrence[0]
            same_list = []
        # End of new word update.
        same_list.append(occurrence)
    if len(current_word) > 0:
        write_occurrences_for_word(current_word,same_list)

    

def parse_and_index_one_file(file_path):
    global doc_id
    print ("at %f" % time.time()) + " indexing "+ file_path + (" as %d" % doc_id)
    if not re.search(".*\.html",file_path):
        return
    file = open(file_path)
    doc_id = doc_id + 1
    position = 0
    occur_list = []
    for word in word_re.findall(file.read()):
        position = position + 1
        occur_list.append((string.lower(word),doc_id, position))
    occur_list.sort()
    write_occurrences(occur_list)
    

def parse_and_index_one_directory(arg, dirname, names):

    for file_name in names:
        file_path=dirname + "/" + file_name
        if os.path.isfile(file_path):
           parse_and_index_one_file(file_path)

# 

os.path.walk(sys.argv[1],parse_and_index_one_directory,None)


dump_occurrences_from_file( get_file_name_for("linux"))


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] 
You will receive a message to confirm this. 

Reply via email to