Hey Shay.
Some friendly advice. Cross-posting a question will make you unpopular
fast. Its best to start on the most appropriate seeming list and only
move on from there if you are getting no satisfaction. The below
question looks best at home over on the archive-access list. Let me
have a go at answering it there.
Yours,
St.Ack
Shay Lawless wrote:
Hi all,
I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version
0.5.0-200611082313) to Index a collection of ARC files generated by a
web crawl using the Heritrix web crawler (Version 1.4.0).
When I check the metadata tag on the wera front-end the following list
of tags are displayed
ARC Identifier
URL
Time of Archival
Last Modified Time
Mime-Type
File Status
Content Checksum
HTTP Header
When I click on the explain link in the NutchWax front-end the
following list of tags are displayed
Segment
Digest
Date
ARCDate
Encoding
Collection
ARCName
ARCOffset
ContentLength
PrimaryType
subType
URL
Title
Boost
Is there a full list of the metadata fields that NutchWax/Nutch
creates when indexing? I'm particularly interested in tags relating to
the actual content on each page i.e. content type, description etc etc
When searching does NutchWax/Nutch search across such tags or just
across the parsed text of each page for occurances of keywords etc?
Any help you can provide would be greatly appreciated!
Shay
------------------------------------------------------------------------
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
------------------------------------------------------------------------
_______________________________________________
Archive-access-discuss mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/archive-access-discuss