I've been crawling the user groups, and I feel like Nutch can do this by default, but I just can't seem to crack it.
I want to grab meta tags from indexed pages and insert them in the database. Specifically, I'll have some meta tags that identity the type of content on the page, so that I can group results as either video, photo, news, etc. I looked into 655 and 855, but I believe those are for adding metadata, not utilizing meta data already in the page. What I expect, is that when I do a dump, I'd have the fields visible in Metadata http://test.site.com/index.html Version: 7 Status: 2 (db_fetched) Fetch time: Wed Mar 02 20:22:33 UTC 2011 Modified time: Thu Jan 01 00:00:00 UTC 1970 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 0.013783041 Signature: 26df10bef4cf4cebe3f1041ba121068d Metadata: _pst_: success(1), lastModified=0, MYFIELD=MYVALUE I think Nutch-779 may be what I need, and as I'm running version 1.2, I should have this capability. I'm filling in db.parsemeta.to.crawldb, but is there something else I need to do? Or is it populating it, and dumping the database doesn't show me those values?