Re: [htdig] i need help on htdig database format

1999-11-25 Thread Gilles Detillieux

According to ronald:
> when htdig exports results from an index as textformat it generates two
> files. The files look like this :
> 
> file1:
> 0 u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software   a:0
> m:936027636 s:373   h:  h:  l:940510479 L:2 I:373   
>d:http://www.htdig.org/www.htdig.orght://Dig Search Software (yes, the developers 
>use it)ht://DigParent Directory   A:

First field:doc ID
u:  URL of doc
t:  doc title
a:  doc state (refer to source)
m:  date/time last modified, sec since 1970-01-01 00:00:00 UTC
s:  doc size in bytes
h:  doc head (excerpt of first max_head_length bytes of doc)
h: (2nd)meta description contents
(this 2nd h is a bug - it really should be a unique value
 like D or something)
l:  date/time document was indexed (sec since 1970)
L:  no. of links doc has to other docs
I:  "docImageSize" - has nothing to do with images, but seems to
contain document size, and may be cumulative in some
circumstances - can anyone else make any sense of this?
d:  link descriptions - text of links to this doc, ^A separated
A:  anchor names (bookmarks) in doc, ^A separated

All fields are tab (^I) separated.  Sub-fields of d & A use ^A separator.
doc head field has all runs of white space (space, tab, newline, etc.)
collapsed to single spaces.

> file2:

This is db.wordlist...

> 01oct99   i:115   l:0 w:100998c:2
> 01oct99   i:116   l:0 w:100998c:2
> 01oct99   i:45l:6 w:100381c:2
> 01oct99   i:46l:0 w:100998c:2
> 02aug1999 i:48l:361   w:639   a:2
> 02jun1999 i:50l:262   w:1382  c:2 a:2
> 02mar1999 i:53l:378   w:622   a:2
> 02may1999 i:51l:280   w:1349  c:2 a:2

First field:indexed word (lower case)
i:  doc ID (to match up with records from above)
l:  location of word in doc (0-1000, i.e. tenth of a percent units)
w:  weight of word in searches
c:  no. of occurrences of word in document, if > 1
a:  index into "A:" list above, to indicate which anchor name,
if any, preceded this word

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.



[htdig] i need help on htdig database format

1999-11-25 Thread ronald

when htdig exports results from an index as textformat it generates two
files. The files look like this :

file1:
0   u:http://www.htdig.org/ t:ht://Dig -- Internet search engine software
a:0 m:936027636 s:373   h:  h:  l:940510479 L:2 I:373   
d:http://www.htdig.org/
www.htdig.org ht://Dig Search Software (yes, the developers use it)
ht://Dig Parent Directory   A:
1   u:http://www.htdig.org/contents.htmlt:ht://Dig Table of Contentsa:0
m:936027636 s:3539  h: Contents General ht://Dig Features and Requirements
Where to get it Installation Configuration FAQ Mailing list Uses of
ht://Dig License information Reference htdig htmerge htnotify htfuzzy
htsearch Configuration file META tags Other How it works Contributors
Release notes ChangeLog TODO Bug Reporting Contributed Work Website stats
Developer Site Quick Search:h:  l:940510479 L:25I:3539  
d:/contents.htmlA:
2   u:http://www.htdig.org/main.htmlt:ht://Dig: Overviewa:0 
m:940044123
s:3717  h: WWW Search Engine Software ht://Dig Copyright (c) 1995-1999 The
ht://Dig Group Please see the file COPYING for license information. Recent
News * 22 Sep 1999: A new stable release of ht://Dig, htdig-3.1.3, is
released. This release is recommended for all production systems. It solves
most of the outstanding bugs in the 3.1.x releases. See the release notes
or download it. * 1 June 1999: Unfortunately, due to lack of interest from
key developers, the ht://Dig Conference from Aug 19-20 will be cancelled.
We hope h:  l:940510480 L:10I:3717  d:ht://Dig /main.html   A:
3  and so on.


file2:
01oct99 i:115   l:0 w:100998c:2
01oct99 i:116   l:0 w:100998c:2
01oct99 i:45l:6 w:100381c:2
01oct99 i:46l:0 w:100998c:2
02aug1999   i:48l:361   w:639   a:2
02jun1999   i:50l:262   w:1382  c:2 a:2
02mar1999   i:53l:378   w:622   a:2
02may1999   i:51l:280   w:1349  c:2 a:2
and so on


Can anyone please tell me exactly what these fields mean ? 

Ronald





_
Ronald Tournier
Stichting De Digitale Stad
1011 TD Amsterdam
tel. 020 6257493
fax. 020 6382817
tel direkt: 020 5205335
e-mail: [EMAIL PROTECTED]



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You'll receive a message confirming the unsubscription.