Hi JB,

We have recently finished a complete plugin tutorial which fully explains
the functionality of the urlmeta plugin on the wiki. It can be found here
[1], could I ask you to have a thorough look at it, and the code and if you
still have questions then please reinforce them.

[1] http://wiki.apache.org/nutch/WritingPluginExample

Thank you

On Wed, Aug 24, 2011 at 9:36 PM, John R. Brinkema <brink...@teo.uscourts.gov
> wrote:

> Hi all,
>
> I am trying use URLmeta to inject meta data into documents that I crawl and
> I am having some problems.
>
> First the context:  Nutch 1.3 with Solr 3.2
>
> My seed url files looks like:  http://mySite.com/Guide/index.**
> html\trecommended= <http://mySite.com/Guide/index.html%5Ctrecommended=>
> "Guide"\**tkeywords="Guide,Policy,**JBmarker"
>
> I put JBmarker there so I could see where the metadata got put.
>
> Index.html itself is a table of contents of a guide; that is, it is mostly
> a list of outlinks to parts of the overall guide.
>
> My nutch-site.xml includes the following properties:
>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|**urlfilter-regex|parse-(html|**
> tika)|index-(basic|anchor|**urlmeta)|scoring-opic|**
> urlnormalizer-(pass|regex|**basic)</value>
> </property>
> <property>
> <name>urlmeta.tags</name>
> <value>recommended,keywords</**value>
> </property>
>
> I fire up nutch to crawl and all goes well.   To see what nutch did, I ran
> 'readseg -dump' and looked at the results.  What I found was the following:
>
> ... other Recno's above ...
>
> Recno:: 56
> URL:: http:/mySite.com/Guide/index.**html
>
> CrawlDatum::
> Version: 7
> Status: 65 (signature)
> Fetch time: Tue Aug 23 10:08:18 EDT 2011
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 0 seconds (0 days)
> Score: 1.0
> Signature: 5c182af41027766eccf1ea60d11277**2c
> Metadata:
>
> CrawlDatum::
> Version: 7
> Status: 1 (db_unfetched)
> Fetch time: Tue Aug 23 10:08:04 EDT 2011
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> "Guide,Policy,JBmarker"
>
> Content::
> Version: -1
> url: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
> base: http://mySite.com/Guide/index.**html<http://mySite.com/Guide/index.html>
> ... lots more content ...
>
> CrawlDatum::
> Version: 7
> Status: 33 (fetch_success)
> Fetch time: Tue Aug 23 10:08:15 EDT 2011
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: null
> Metadata: recommended: Guide_ngt_: 1314108489210keywords:
> "Guide,Policy,JBmarker"_pst_: success(1), lastModified=0
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Guide
> Outlinks: 60
>  outlink: toUrl: 
> http://mySite.com/Home/About.**html<http://mySite.com/Home/About.html>anchor: 
> About Me
>  outlink: toUrl: 
> http://mySite.com/Guide/**Contact_The_Guide.html<http://mySite.com/Guide/Contact_The_Guide.html>anchor:
>  Contact Me
> ... many more outlinks ...
> Content Metadata: nutch.content.digest=**5c182af41027766eccf1ea60d11277**2c
> Accept-ranges=bytes Date=Tue, 23 Aug 2011 16:28:43 GMT Content-Length=28798
> Last-Modified=Wed, 06 Apr 2011 00:15:10 GMT nutch.crawl.score=1.0 _fst_=33
> nutch.segment.name=**20110823100811 Content-Type=text/html
> Connection=close Server=Netscape-Enterprise/6.0
> Parse Metadata: CharEncodingForConversion=**windows-1252
> OriginalCharEncoding=windows-**1252
>
> ParseText::
> ... lots of parsed text ...
>
> Recno::  57
>
> ... and so forth.
>
> JBmarker does not appear anywhere else, in this segment or any of the
> others.
>
> When I do a solrindex, JBmarker does not appear to be anywhere.  ??
>
> *What I expected*
>
> As I understand ULRmeta (as defined by the two nutch patches), the meta
> data that is included with the url  is injected into the seed url; that is
> to say, it is as if the lines:
>
> <META NAME="recommended" CONTENT="Guide">
> <META NAME="keywords" CONTENT="Guide,Policy,**JBmarker">
>
> were in the seed url content.  Furthermore,  it is as if those two lines
> were in all the outlink content of the seed url.  So, I expected that when I
> looked at all the CrawlDatum and ParseData of the outlinks from the seed
> url, I would see the same meta data as in the seed CrawlDatum and ParseData.
>  Which is clearly not the case.
>
> As for solrindex, I assume that I have some work to do to get any special
> metadata actions moved over to solr; a special plugin of some sort.  That
> is, urlmeta does not help get the collected metadata from Nutch to Solr.
>
> So what is happening?  Where did I go astray?  Am I analyzing the Nutch
> dumps incorrectly?
>
> One other side note:  I assume that Luke no longer will help me debug Nutch
> since it works with Lucene indexes and Nutch no longer create such beasts.
>  Are there any tools that help with viewing Nutch databases?  It seems that
> Nutch takes some liberties with the data it is dumping (e.g., the meta tags
> all concatenated together without delimiters; I assume that internally, the
> meta tags are separated somehow).
>
> Thanks, as always.
>
>
>
>
>
>
>
>


-- 
*Lewis*

Reply via email to