Hi all,
I'm using NutchWax (Version 0.7.0-200611082313) and Wera (Version
0.5.0-200611082313) to Index a collection of ARC files generated by a web
crawl using the Heritrix web crawler (Version 1.4.0).
When I check the metadata tag on the wera front-end the following list of
tags are displayed
ARC Identifier
URL
Time of Archival
Last Modified Time
Mime-Type
File Status
Content Checksum
HTTP Header
When I click on the explain link in the NutchWax front-end the following
list of tags are displayed
Segment
Digest
Date
ARCDate
Encoding
Collection
ARCName
ARCOffset
ContentLength
PrimaryType
subType
URL
Title
Boost
Is there a full list of the metadata fields that NutchWax/Nutch creates when
indexing? I'm particularly interested in tags relating to the actual content
on each page i.e. content type, description etc etc
When searching does NutchWax/Nutch search across such tags or just across
the parsed text of each page for occurances of keywords etc?
Any help you can provide would be greatly appreciated!
Shay
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers