|
Thanks Jim. I have started merging files as you instructed.. For url indexing, i found there are lots of urls which are different from our site. In your explanation below - it is the point one you mentioned... I have used 'start_url' though, but still it is spidering different urls. Now i used the following parameters. common_url_parts: http://www.example.com/ I am merging all the files now.. Thre are 20 sites i need to merge. While i was creating the individual database, i found one file (huge volume) is being created as named "core". What is the file for? I have deleted the file, it seems it is a binary.. It should be really grate if you clear my confusions. Best regards -Tinni
|
Jim <[EMAIL PROTECTED]> wrote:On Thu, 8 Apr 2004, [iso-8859-1] Tinni wrote:
> - We have almost 99 sites, now i want all 99 sites will be merged into a
> single database.. I want every site's database willbe created seperately
> so that i can create seperately the databse - this is for server load/space
> etc. Finally i will merge all sites into a single database..
> Could you please give me the idea how i will run 'rundig' , 'htmerge'
> for the above requirement?
You first need to create separate configuration files for each set of
databases. It is probably easiest to start with copies of the default
configuration file and make edits as necessary. At a minimum you should
probably take a look at the following attributes.
database_base
database_dir
start_url
limit_urls_to
After creating the appropriate configuration files, you can build each
database set by using the standard rundig script and passing the
corresponding configuration file with the '-c' option. In order to merge
everything together, you need to call htmerge repeatedly with the '-m'
option. The merge step is explained in the documentation for htmerge.
See http://www.htdig.org/htmerge.html
I would suggest that you start with two or three sets of databases and
work with that until you are comfortable with the process and verify
that you have worked out any kinks that you might run into.
> - I am seeing while running rundig, with one of my site, it is spidering
> all the sites - i have set the start_url parameter as follows:
>
> start_url: http://www.example.org/
>
> where example.org is the main site.. I want sites related to our
> sites only , will be spidered.. But it is spidering many many sites, which
> are not related to our specific.. So how can i configure this type of setting...
Is the problem that htdig is hitting sites that are not part of
www.example.com or that it is hitting parts of www.example.com that you
don't want it to hit. If the former, then check your limit_urls_to
attribute. The default setting (${start_url}) would limit the dig to
URLs that include the string "www.example.com". If the problem is the
latter, then it is hard to provide a general answer. You will most
likely need to play with combinations of start_url, limit_urls_to and
exlude_urls to get the effect you are looking for.
Jim
Yahoo! India Matrimony: Find your partner online.

