Re: Need help

abhishek tiwari Tue, 29 May 2012 06:07:42 -0700

Thanks for replying ..I am not able to fetch keyword
my nutch-site.xml is

<configuration>
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
<property>
<name>metatags.names</name>
<value>description;keywords</value>
<description> Names of the metatags to extract, separated by;.
  Use '*' to extract all metatags. Prefixes the names with 'metatag.'
  in the parse-metadata. For instance to index description and keywords,
  you need to activate the plugin index-metadata and set the value of the
  parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
</description>
</property>
<property>
  <name>index.parse.md</name>
  <value>metatag.description,metatag.keywords</value>
  <description>
  Comma-separated list of keys to be taken from the parse metadata to
generate fields.
  Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
  by a parser (see parse-metatags plugin)
  </description>
</property>


</configuration>

and solr schema has following field

~                 <fields><field name="id" type="string" stored="true"
indexed="true"/><!-- core fields --><field name="segment" type="string"
stored="true" indexed="false"/><field name="digest" type="string"
stored="true" indexed="false"/><field name="boost" type="float"
stored="true" indexed="false"/><!-- fields for index-basic plugin --><field
name="host" type="url" stored="false" indexed="true"/><field name="site"
type="string" stored="false" indexed="true"/><field name="url" type="url"
stored="true" indexed="true" required="true"/><field name="content"
type="text" stored="true" indexed="true"/><field name="title" type="text"
stored="true" indexed="true"/><field name="cache" type="string"
stored="true" indexed="false"/><field name="tstamp" type="date"
stored="true" indexed="false"/><!-- fields for index-anchor plugin
--><field name="anchor" type="string" stored="true" indexed="true"
multiValued="true"/><!-- fields for index-more plugin --><field name="type"
type="string" stored="true" indexed="true" multiValued="true"/><field
name="contentLength" type="long" stored="true" indexed="false"/><field
name="lastModified" type="date" stored="true" indexed="false"/><field
name="date" type="date" stored="true" indexed="true"/><!-- fields for
languageidentifier plugin --><field name="lang" type="string" stored="true"
indexed="true"/><!-- fields for subcollection plugin --><field
name="subcollection" type="string" stored="true" indexed="true"
multiValued="true"/><!-- fields for feed plugin (tag is also used by
microformats-reltag)--><field name="author" type="string" stored="true"
indexed="true"/><field name="tag" type="string" stored="true"
indexed="true" multiValued="true"/><field name="feed" type="string"
stored="true" indexed="true"/><field name="publishedDate" type="date"
stored="true" indexed="true"/><field name="updatedDate" type="date"
stored="true" indexed="true"/><!-- fields for creativecommons plugin
--><field name="cc" type="string" stored="true" indexed="true"
multiValued="true"/><!-- fields for the metatags plugin --><field
name="metatag.description" type="text" stored="true" indexed="true"/><field
name="metatag.keywords" type="text" stored="true" indexed="true"/></fields>


i am not able to get the problem .

Ihave created the own plugin bt it is not populated . when we crawl.

please help me to find the reason.

On Wed, May 23, 2012 at 3:47 PM, Julien Nioche <
[email protected]> wrote:

> the urlmeta plugin is not what you are after. see instructions on
> http://wiki.apache.org/nutch/IndexMetatags
>
> On 23 May 2012 10:30, abhishek tiwari <[email protected]> wrote:
>
> > Hi, i am new for nutch.
> >
> >
> >
> > i want to use urlmeta plugin  bt not able to fetch meta tags .
> >
> >
> > 1) Added folllowing in nutch-site.xml
> >
> >  <property>
> >  <name>plugin.includes</name>
> >
> >
>  
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)|urlmeta</value>
> >  <description>Regular expression naming plugin directory names to
> >  include.  Any plugin not matching this expression is excluded.
> >  In any case you need at least include the nutch-extensionpoints plugin.
> By
> >  default Nutch includes crawling just HTML and plain text via HTTP,
> >  and basic indexing and search plugins.
> >  </description>
> > </property>
> > <property>
> >  <name>urlmeta.tags</name>
> >  <value></value>
> >  <description>
> >
> >  </description>
> > </property>
> >
> >
> > 2) Added  <field name="keywords" type="string" stored="true"
> > indexed="true"/>  in solr schema.xml
> >
> > 3) run  bin/nutch crawl urls -solr http://localhost:8080/solr -depth 3
> > -topN 5
> >
> > url and other stuffs also done
> >
> > but keyword field is not getting populated .
> >
> > please suggest what i am missing.
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Need help

Reply via email to