Hi Gilles,

  I too am having issues performing merges with htmerge. My scenario is a 
little different in that I am creating a master database and then merging 2 
other databases into it.......

  I have followed the threads between you both and have made some progress 
but am still having difficulties - part of the issue is I am using a newer 
snapshot I think because I don't have the db.wordlist file at all - I am 
using version 3.2.0b4 - the files that are created are as follows:
        db.docdb
        db.docs.index
        db.excerpts
        db.words.db
        db.words.db_weakcmpr

  I build the master database with the -a option so the .work files are 
created - I have reworked my script to cp the .work files instead of mv them 
- however I am not sure which files are required by htmerge - do I need all 
the .work files for a successful merge?

  Furthermore - I am not clear on what actual commands I need to run - here 
is what I'm doing now - am I missing something??

        BUILD MASTER DATABASE
        rundig -vvv -a -c configfile > htdig.out
        cp -p $dbdir/db.docdb.work $dbdir/db.docdb
        cp -p $dbdir/db.docs.index.work $dbdir/db.docs.index
        cp -p $dbdir/db.excerpts.work $dbdir/db.excerpts
        cp -p $dbdir/db.words.db.work $dbdir/db.words.db
        cp -p $dbdir/db.words.db.work_weakcmpr $dbdir/db.words.db_weakcmpr

        BUILD 1ST MERGE DATABASE
        rundig -vvv -a -c configfile2 > htdig2.out
        mv $dbdir2/db.docdb.work $dbdir2/db.docdb
        mv $dbdir2/db.docs.index.work $dbdir2/db.docs.index
        mv $dbdir2/db.excerpts.work $dbdir2/db.excerpts
        mv $dbdir2/db.words.db.work $dbdir2/db.words.db
        mv $dbdir2/db.words.db.work_weakcmpr $dbdir2/db.words.db_weakcmpr

        DO THE MERGE INTO THE MASTER DATABASE
        htmerge -a -v -c configfile -m configfile2

        COPY THE .WORK FILES BACK TO THE MASTER DB FILES
        cp -p $dbdir/db.docdb.work $dbdir/db.docdb
        cp -p $dbdir/db.docs.index.work $dbdir/db.docs.index
        cp -p $dbdir/db.excerpts.work $dbdir/db.excerpts
        cp -p $dbdir/db.words.db.work $dbdir/db.words.db
        cp -p $dbdir/db.words.db.work_weakcmpr $dbdir/db.words.db_weakcmpr

        BUILD THE 2ND MERGE DATABASE
        rundig -vvv -a -c configfile3 > htdig3.out
        mv $dbdir3/db.docdb.work $dbdir3/db.docdb
        mv $dbdir3/db.docs.index.work $dbdir3/db.docs.index
        mv $dbdir3/db.excerpts.work $dbdir3/db.excerpts
        mv $dbdir3/db.words.db.work $dbdir3/db.words.db
        mv $dbdir3/db.words.db.work_weakcmpr $dbdir3/db.words.db_weakcmpr

        DO THE MERGE INTO THE MASTER DATABASE
        htmerge -a -v -c configfile -m configfile3

        COPY THE .WORK FILES BACK TO THE MASTER DB FILES
        cp -p $dbdir/db.docdb.work $dbdir/db.docdb
        cp -p $dbdir/db.docs.index.work $dbdir/db.docs.index
        cp -p $dbdir/db.excerpts.work $dbdir/db.excerpts
        cp -p $dbdir/db.words.db.work $dbdir/db.words.db
        cp -p $dbdir/db.words.db.work_weakcmpr $dbdir/db.words.db_weakcmpr

  I figured that this would result in a master database containing ALL the 
contents of the 2 smaller databases merged in, however I seem to get only the 
database for configfile3 successfully merged because when I do a search that 
should yield results for configfile2 I get 0 results - yet I get the 
appropriate results for configfile2 when I run a search......

  I thought using the "rundig" script may be causing issues because it does 
some other stuff to the database files, but it seems to merge at least some 
of the other databases into the master one.....

  I'm quite confused - could you help clear things up for me? Thanks!

Cheers,
Jonathan Schlackl


On Wednesday 20 November 2002 09:04, you wrote:
> According to Dan Langille:
> > On 13 Nov 2002 at 21:48, Gilles Detillieux wrote:
> > > According to Dan Langille:
> > > > I have indexed a mailing list archive.  My next goal is to nightly
> > > > update that index by indexing the entire month's archive and then
> > > > merging that into the main database.  At present, there are about 4
> > > > years of data.  I'm seeking comments on my approach.
> > >
> > > You may also want to have a look at how we do it for the htdig-general
> > > and htdig-dev archives:
> > >
> > > http://www.htdig.org/files/contrib/scripts/README.geoupdate-ungeoify
> > > http://www.htdig.org/files/contrib/scripts/geoupdate.sh
> >
> > My results are interesting, and confusing.  The problem is
> > incomplete results.  I will explain as I go along.
> >
> > My base working directory is /usr/local/htdig.  The "production"
> > databases is:
> >
> > [dan@undef:/usr/local/htdig] $ ls -l databases/
> > total 214713
> > drwxr-xr-x  2 dan  dan       512 Nov 20 09:43 adsl-update
> > -rw-r--r--  1 dan  dan  70942720 Nov 10 11:17 db.docdb
> > -rw-r--r--  1 dan  dan   1940480 Nov 10 11:17 db.docs.index
> > -rw-r--r--  1 dan  dan  80104230 Nov 10 11:17 db.wordlist
> > -rw-r--r--  1 dan  dan  66720768 Nov 10 11:17 db.words.db
> >
> > I believe that the initial dig and merge are operating correctly.
> > The following files are created:
> >
> > [dan@undef:/usr/local/htdig] $ ls -l databases/adsl-update/
> > total 817
> > -rw-r--r--  1 dan  dan  221184 Nov 20 11:00 db.docdb
> > -rw-r--r--  1 dan  dan    9216 Nov 20 11:00 db.docs.index
> > -rw-r--r--  1 dan  dan  234216 Nov 20 11:00 db.wordlist
> > -rw-r--r--  1 dan  dan  344064 Nov 20 11:00 db.words.db
> >
> > It is the next merge which is the cause of the problem.  The command
> > is "/usr/local/bin/htmerge -a -c htdig-unixathome.org-adsl.conf -m
> > adsl-update.conf" issued from /usr/local/htdig.  This results in:
> >
> >    htmerge: Unable to open word list file
> >    '/usr/local/htdig/databases/db.wordlist.work'
>
> ...
>
> > Why is it expecting a .work file on input?
>
> The reason for the -a on the htmerge command is so that the main database
> can be updated (in the .work copy), while the actual main database
> is still available to and usable by htsearch, in case the merge takes
> a while.  Then, once the merge is completed, the resulting files can be
> moved and/or copied into place rather quickly.
>
> > I note that the geoupdate.sh script leaves behinad only db.docdb.work
> > and db.wordlist.work.
>
> Yes, this is because these two files are all that htmerge needs.
> The other ones (db.words.db and db.docs.index) are generated by htmerge
> from the first two files.  But, because the script leaves these two
> files around, it also assumes the two files are there to begin with.
> So, the inital htdig and htmerge that created the main database should
> have been done with the -a option as well, or the .work files need to
> be created manually by copying their non-.work counterparts.
>
> > To supply the file, I do this in the databases subdirectory:
> > cp db.wordlist db.wordlist.work
> >
> > Running the htmerge then results in incomplete results.
> > Specifically, the db.docdb.work and db.docs.index.work is way too
> > small when compared to the original files
>
> ...
>
> That's because you should have copied db.docdb to db.docdb.work before
> running htmerge.  Without the existing db.docdb.work, it would seem that
> htmerge simply created one as a starting point, so obviously it's going
> to be missing a lot of records!
>
> > I also tried this approach
> >
> > cd /usr/local/htdig/databases
> > cp db.docdb db.docdb.work
> > cp db.docs.index db.docs.index.work
> > cp db.wordlist db.wordlist.work
> > cp db.words.db db.words.db.work
> >
> > Then I reran the merge to obtain a better result:
>
> ...
>
> > What am I not understanding?
>
> htmerge needs both the db.wordlist and db.docdb.  If you're using -a,
> then it needs the .work version of both of these files.  When you use
> htmerge -m, it merges the db.wordlist from the 2nd set into the main
> one, and merges the db.docdb records from the 2nd set into the first.
> After that, htmerge generates db.docs.index from the words in db.wordlist
> (or the .work files of both of these with the -a option), and generates
> the db.docs.index from the db.docdb records (or the .work files).  So,
> if you're using -a, you should make sure htmerge has the db.docdb.work
> and db.wordlist.work files for the main database.  Also, because htmerge
> doesn't need the db.wordlist file, it can remain as a .work file.
> So your main databases directory should have the following files before
> and after the update script runs:
>
> db.wordlist.work      - needed by htdig -a and/or htmerge -a
> db.docdb.work         - needed by htdig -a and/or htmerge -a
> db.docdb              - needed by htsearch
> db.docs.index         - needed by htsearch
> db.words.db           - needed by htsearch
>
>
> For the first run of the script, if you have just db.docdb and db.wordlist
> to begin with, the only commands you'd need are...
>
> cd /usr/local/htdig/databases
> cp db.docdb db.docdb.work
> mv db.wordlist db.wordlist.work


-------------------------------------------------------
This sf.net email is sponsored by: 
Battle your brains against the best in the Thawte Crypto 
Challenge. Be the first to crack the code - register now: 
http://www.gothawte.com/rd521.html
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to