[ 
https://issues.apache.org/jira/browse/NUTCH-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13119281#comment-13119281
 ] 

Markus Jelsma commented on NUTCH-1143:
--------------------------------------

It seems the anchor field was once used for indexing the best ranking anchor 
for a given URL but the indexing code is legacy. With the current version users 
must invert links and pass the linkdb and enable index-anchor to index anchors 
so having an anchor in LinkDatum is obsolete for now. 

Instead of completely removing the anchor code we should make it optional, by 
doing that we can write indexing code later and pass the webgraph to the 
indexer instead of a linkdb.

I opt for defaulting the setting to false (i.e. do not store anchors) since 
they are unusable at the moment.
                
> Omit anchor in webgraph's LinkDatum
> -----------------------------------
>
>                 Key: NUTCH-1143
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1143
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.5
>
>
> Anchors are stored  unchecked in the webgraph. it looks like for cosmetic 
> reasons only. When dealing with hundreds of millions of records it takes up 
> significant space and I/O time.
> This issue should add an option to omit the anchor.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to