Hi JB, We have recently finished a complete plugin tutorial which fully explains the functionality of the urlmeta plugin on the wiki. It can be found here [1], could I ask you to have a thorough look at it, and the code and if you still have questions then please reinforce them.
[1] http://wiki.apache.org/nutch/WritingPluginExample Thank you On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema <brink...@teo.uscourts.gov > wrote: > Hi all, > > I am trying use URLmeta to inject meta data into documents that I crawl and > I am having some problems. > > First the context: Nutch 1.3 with Solr 3.2 > > My seed url files looks like: http://mySite.com/Guide/index.** > html\trecommended= <http://mySite.com/Guide/index.html%5Ctrecommended=> > "Guide"\**tkeywords="Guide,Policy,**JBmarker" > > I put JBmarker there so I could see where the metadata got put. > > Index.html itself is a table of contents of a guide; that is, it is mostly > a list of outlinks to parts of the overall guide. > > My nutch-site.xml includes the following properties: > > <property> > <name>plugin.includes</name> > <value>protocol-http|**urlfilter-regex|parse-(html|** > tika)|index-(basic|anchor|**urlmeta)|scoring-opic|** > urlnormalizer-(pass|regex|**basic)</value> > </property> > <property> > <name>urlmeta.tags</name> > <value>recommended,keywords</**value> > </property> > > I fire up nutch to crawl and all goes well. To see what nutch did, I ran > 'readseg -dump' and looked at the results. What I found was the following: > > ... other Recno's above ... > > Recno:: 56 > URL:: http:/mySite.com/Guide/index.**html > > CrawlDatum:: > Version: 7 > Status: 65 (signature) > Fetch time: Tue Aug 23 10:08:18 EDT 2011 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 0 seconds (0 days) > Score: 1.0 > Signature: 5c182af41027766eccf1ea60d11277**2c > Metadata: > > CrawlDatum:: > Version: 7 > Status: 1 (db_unfetched) > Fetch time: Tue Aug 23 10:08:04 EDT 2011 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: recommended: Guide_ngt_: 1314108489210keywords: > "Guide,Policy,JBmarker" > > Content:: > Version: -1 > url: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html> > base: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html> > ... lots more content ... > > CrawlDatum:: > Version: 7 > Status: 33 (fetch_success) > Fetch time: Tue Aug 23 10:08:15 EDT 2011 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 2592000 seconds (30 days) > Score: 1.0 > Signature: null > Metadata: recommended: Guide_ngt_: 1314108489210keywords: > "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0 > > ParseData:: > Version: 5 > Status: success(1,0) > Title: Guide > Outlinks: 60 > outlink: toUrl: > http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html>anchor: > About Me > outlink: toUrl: > http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Guide/Contact_The_Guide.html>anchor: > Contact Me > ... many more outlinks ... > Content Metadata: nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c > Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT Content-Length=28798 > Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0 _fst_=33 > nutch.segment.name=**20110823100811 Content-Type=text/html > Connection=close Server=Netscape-Enterprise/6.0 > Parse Metadata: CharEncodingForConversion=**windows-1252 > OriginalCharEncoding=windows-**1252 > > ParseText:: > ... lots of parsed text ... > > Recno:: 57 > > ... and so forth. > > JBmarker does not appear anywhere else, in this segment or any of the > others. > > When I do a solrindex, JBmarker does not appear to be anywhere. ?? > > *What I expected* > > As I understand ULRmeta (as defined by the two nutch patches), the meta > data that is included with the url is injected into the seed url; that is > to say, it is as if the lines: > > <META NAME="recommended" CONTENT="Guide"> > <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker"> > > were in the seed url content. Furthermore, it is as if those two lines > were in all the outlink content of the seed url. So, I expected that when I > looked at all the CrawlDatum and ParseData of the outlinks from the seed > url, I would see the same meta data as in the seed CrawlDatum and ParseData. > Which is clearly not the case. > > As for solrindex, I assume that I have some work to do to get any special > metadata actions moved over to solr; a special plugin of some sort. That > is, urlmeta does not help get the collected metadata from Nutch to Solr. > > So what is happening? Where did I go astray? Am I analyzing the Nutch > dumps incorrectly? > > One other side note: I assume that Luke no longer will help me debug Nutch > since it works with Lucene indexes and Nutch no longer create such beasts. > Are there any tools that help with viewing Nutch databases? It seems that > Nutch takes some liberties with the data it is dumping (e.g., the meta tags > all concatenated together without delimiters; I assume that internally, the > meta tags are separated somehow). > > Thanks, as always. > > > > > > > > -- *Lewis*