[
https://issues.apache.org/jira/browse/NUTCH-762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848095#action_12848095
]
Julien Nioche commented on NUTCH-762:
-------------------------------------
{quote}
I just noticed that the new Generator uses different config property names
("generator." vs. "generate."), and the older versions are now marked with
"(Deprecated)". However, this doesn't reflect the reality - properties with old
names are simply ignored now, whereas "deprecated" implies that they should
still work
{quote}
They will still work if we keep the old Generator as OldGenerator - which is
what we assume in the patch. If we decide to get shot of the OldGenerator then
yes, they should not be marked with "(Deprecated)"
{quote}
For back-compat reason I think they should still work - the current (admittedly
awkward) prefix is good enough, and I think that changing it in a minor release
would create confusion. I suggest reverting to the old names where appropriate,
and add new properties with the same prefix, i.e. "generate.".
{quote}
the original assumption was that we'd keep both this version of the generator
and the old one in which case we could have used a different prefix for the
properties. If we want to *replace* the old generator altogether - which I
think would be a good option - then indeed we should discuss whether or not to
align on the old prefix.
I don't have strong feelings on whether or not to modify the prefix in a minor
release.
> Alternative Generator which can generate several segments in one parse of the
> crawlDB
> -------------------------------------------------------------------------------------
>
> Key: NUTCH-762
> URL: https://issues.apache.org/jira/browse/NUTCH-762
> Project: Nutch
> Issue Type: New Feature
> Components: generator
> Affects Versions: 1.0.0
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Fix For: 1.1
>
> Attachments: NUTCH-762-v2.patch, NUTCH-762-v3.patch
>
>
> When using Nutch on a large scale (e.g. billions of URLs), the operations
> related to the crawlDB (generate - update) tend to take the biggest part of
> the time. One solution is to limit such operations to a minimum by generating
> several fetchlists in one parse of the crawlDB then update the Db only once
> on several segments. The existing Generator allows several successive runs by
> generating a copy of the crawlDB and marking the URLs to be fetched. In
> practice this approach does not work well as we need to read the whole
> crawlDB as many time as we generate a segment.
> The patch attached contains an implementation of a MultiGenerator which can
> generate several fetchlists by reading the crawlDB only once. The
> MultiGenerator differs from the Generator in other aspects:
> * can filter the URLs by score
> * normalisation is optional
> * IP resolution is done ONLY on the entries which have been selected for
> fetching (during the partitioning). Running the IP resolution on the whole
> crawlDb is too slow to be usable on a large scale
> * can max the number of URLs per host or domain (but not by IP)
> * can choose to partition by host, domain or IP
> Typically the same unit (e.g. domain) would be used for maxing the URLs and
> for partitioning; however as we can't count the max number of URLs by IP
> another unit must be chosen while partitioning by IP.
> We found that using a filter on the score can dramatically improve the
> performance as this reduces the amount of data being sent to the reducers.
> The MultiGenerator is called via : nutch
> org.apache.nutch.crawl.MultiGenerator ...
> with the following options :
> MultiGenerator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers
> numFetchers] [-adddays numDays] [-noFilter] [-noNorm] [-maxNumSegments num]
> where most parameters are similar to the default Generator - apart from :
> -noNorm (explicit)
> -topN : max number of URLs per segment
> -maxNumSegments : the actual number of segments generated could be less than
> the max value select e.g. not enough URLs are available for fetching and fit
> in less segments
> Please give it a try and less me know what you think of it
> Julien Nioche
> http://www.digitalpebble.com
>
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.