Hi,

have a look at the scoring filter interface, esp. the plugin scoring-depth.
In the method distributeScoreToOutlinks the fromUrl is at hand and it's no
big deal to add it to the CrawlDatum's metadata of all outlinks.

In the method updateDbScore it must be finally added to CrawlDb's CrawlDatum 
"datum".

Just modify an existing plugin or implement your own.

To finally index the parent URL, add the metadata key which holds the 
parent/from URL
to the property index.db.md:

<property>
  <name>index.db.md</name>
  <value></value>
  <description>
     Comma-separated list of keys to be taken from the crawldb metadata to 
generate fields.
     Can be used to index values propagated from the seeds with the plugin 
urlmeta
  </description>
</property>

Cheers,
Sebastian

On 11/27/2016 12:49 PM, ashokraj.lourdus...@cognizant.com wrote:
> Hi,
> 
> 
> While nutch1.x is indexing in solr (or Elasticsearch) I need to include the 
> immediate parent URL too.
> 
> There is no clear help online on where to do this.
> 
> I don't need the hierarchy till seed url, but just the immediate parent of 
> current parsing document.
> 
> 
> Someone suggested to do it on outlinks, but in code, can anyone help me where 
> to find this and include it.
> 
> I have Nutch setup in my eclipse.
> 
> 
> 
> Thanks in advance,
> 
> -Ashok.
> 
> This e-mail and any files transmitted with it are for the sole use of the 
> intended recipient(s) and may contain confidential and privileged 
> information. If you are not the intended recipient(s), please reply to the 
> sender and destroy all copies of the original message. Any unauthorized 
> review, use, disclosure, dissemination, forwarding, printing or copying of 
> this email, and/or any action taken in reliance on the contents of this 
> e-mail is strictly prohibited and may be unlawful. Where permitted by 
> applicable law, this e-mail and other e-mail communications sent to and from 
> Cognizant e-mail addresses may be monitored.
> 

Reply via email to