[
https://issues.apache.org/jira/browse/NUTCH-488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Marcin Okraszewski updated NUTCH-488:
-
Attachment: ignore_tags_v3.patch
OK, yet another approach based on Doğacan comments. Sorry for delay, but I
didn't notice the comment earlier.
- I didn't notice the conf.getStrings() method. Thanks for hint :)
- I did made the backward compatibility with the use_action param, but it
works a bit different now, if there is no value set. Now, default is that it
should use the forms. But it can be dropped with ignore_tags setting if not
specified. If someone has the use_action set to true explicite, then it cannot
be overridden by the ignore_tags. It is still a bit inconsitent, but it is
understandable that specific setting (use_action) has precedence. If default is
false then if you do not have use_action defined and not added to
ignore_tags, then one could expect that form is taken. But it wouldn't be.
Keeping the backward compatibility make the code a bit clumsy :( ... and I
think I've made it over flexible, but that was the cleanest solution here.
- For the repeating if; I agree, it is error prone, but on the other hand it is
easy to understand. I didn't quite understand Dogacan's proposal :( but I think
I did something acceptable - simply remove all specified tags from link params.
Avoid parsing uneccessary links and get a more relevant outlink list
Key: NUTCH-488
URL: https://issues.apache.org/jira/browse/NUTCH-488
Project: Nutch
Issue Type: Improvement
Affects Versions: 0.9.0
Environment: Windows, Java 1.5
Reporter: Emmanuel Joke
Attachments: DOMContentUtils.patch, ignore_tags_v2.patch,
ignore_tags_v3.patch, nutch-default.xml.patch
NekoHTML parser use a method to extract all outlinks from the HTML page. It
will extracts them from the HTML content based on the list of param defined
in the method setConf(). Then this list of links will be truncated to be
limit to the the maximum number of outlinks that we'll process for a page
defined in nutch-default.xml (db.max.outlinks.per.page = 100 by default ) and
finally it will be go through all urlfilter defined.
Unfortunetly it can happen that the list of outlinks is more than 100, so it
will truncated the list and could remove some relevant links.
So I've added few options in the nutch-default.xml in order to enable/disable
the extraction of specific HTML Tag links in this parser (SCRIPT, IMG, FORM,
LINK).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.