[jira] [Created] (NUTCH-1628) Chocolatey package for Windows users
Andrew Pennebaker created NUTCH-1628: Summary: Chocolatey package for Windows users Key: NUTCH-1628 URL: https://issues.apache.org/jira/browse/NUTCH-1628 Project: Nutch Issue Type: Improvement Components: build Environment: Chocolatey (http://chocolatey.org/) Windows XP+ Reporter: Andrew Pennebaker Priority: Minor Setting up developer tools in Windows can be a trial. If we provided a Chocolatey package for nutch, it could bring more Windows users into the fold, encouraging them to use nutch as a dependency in larger software systems. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1627) Debian package for installing nutch
Andrew Pennebaker created NUTCH-1627: Summary: Debian package for installing nutch Key: NUTCH-1627 URL: https://issues.apache.org/jira/browse/NUTCH-1627 Project: Nutch Issue Type: Improvement Components: build Environment: Ubuntu 12.04 Precise Reporter: Andrew Pennebaker Priority: Minor The simpler it is to install nutch, the easier it is to start using it. Could we please create a build task for generating a .deb installer for Debian/Ubuntu? Eventually, it would be great to have a PPA, and then an official package in the Ubuntu apt repo. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1626) Homebrew formula for installing Nutch in Mac OS X
Andrew Pennebaker created NUTCH-1626: Summary: Homebrew formula for installing Nutch in Mac OS X Key: NUTCH-1626 URL: https://issues.apache.org/jira/browse/NUTCH-1626 Project: Nutch Issue Type: Improvement Components: build Environment: Homebrew (http://brew.sh/) Mac OS X 10.5+ Reporter: Andrew Pennebaker Priority: Minor Manually installing nutch takes time and effort out of a developer's day. It would be a great convenience to have an install formula for Homebrew for Mac users! I have begun working on such a formula: https://github.com/mxcl/homebrew/pull/22004 After `brew install nutch`, you can run `nutch`, but the associated tools like `nutch junit` aren't working for some reason. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
nofollow behaviour [#NUTCH-693]
Hi, I've experimented with Nutch for crawling Tor hidden services and I still find an annoying issue that requires a patched Nutch version. That is #NUTCH-693 [1] This issue is a request for an option to control the behaviour of Nutch when getting a rel="nofollow" link. Currently, Nutch always ignores such links and there is no way of configuring this behaviour without patching it. The issue was closed with little discussion claiming that such option would be the same as an hypothetical "ignore.robotstxt" option. This is not the case. robots.txt is the way for webmasters to prevent crawlers to access certain URLs. This is *not* the job of nofollow. robots.txt is always controlled by the webmaster and, as such, it makes sense to strictly honouw it. On the other hand, nofollow is always controlled by third parties (otherwise, robots.txt should be used) and its well-established use is indicating non-endorsement to an URL. That is, in practice, preventing giving link-juice to potential spammers. nofollow is not meant to be an access control mechanism. nofollow is not meant to protect websites from crawler abuse either. That is robots.txt's job. So there is no point in treating them as the same. Now, there are very real use cases for following links with the rel="nofollow" attribute. In a loosely connected portion of the web, following these links might be the only sane way to crawl successfully. The Tor deepweb is a very clear case. There is a site which is very central in the Tor link-graph: The Hidden Wiki. It is a great seed for crawling Tor. But it's MediaWiki-based. And that means that every external link is tagged as rel="nofollow". Finding enough good seed URLs to crawl Tor without going through rel="nofollow" links is not trivial at all. The same might happen when crawling corporate intranets, I2P or other networks. So there is a clear use case for adding an option for following rel="nofollow" links. And, as far as I know, there is no point in not adding it. That is why I would like this to be discussed and, if deemed sensible, #NUTCH-693 reopened. [1] https://issues.apache.org/jira/browse/NUTCH-693 Best, -- Santiago M. Mola Jabber ID: cooldw...@gmail.com
[jira] [Commented] (NUTCH-1623) Implement file.content.ignored function
[ https://issues.apache.org/jira/browse/NUTCH-1623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743856#comment-13743856 ] Osy commented on NUTCH-1623: Sure Lewis, For Nutch 2.2.1 in nutch-default.xml there is a description for this functionality (!! NO IMPLEMENTED YET !!): If true, no file content will be saved during fetch. And it is probably what we want to set most of time, since file:// URLs are meant to be local and we can always use them directly at parsing and indexing stages. Otherwise file contents will be saved. Exactly what I need. Thanks > Implement file.content.ignored function > --- > > Key: NUTCH-1623 > URL: https://issues.apache.org/jira/browse/NUTCH-1623 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Affects Versions: 2.2, 2.2.1 >Reporter: Osy > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
Re: crawl.gen.delay
yes, it is used in Nutch 1.x , but never used in Nutch 2.x. because in Nutch 2.x it will never generate selected url. the correct expression of crawl.gen.crawl is milliseconds you can check the Nutch 1.x nutch-default.xml. the property description like this: crawl.gen.delay 60480 This value, expressed in milliseconds, defines how long we should keep the lock on records in CrawlDb that were just selected for fetching. If these records are not updated in the meantime, the lock is canceled, i.e. they become eligible for selecting. Default value of this is 7 days (60480 ms). Maybe it is wrong. On Fri, Aug 16, 2013 at 3:17 AM, kaveh minooie wrote: > crawl.gen.delay -- Don't Grow Old, Grow Up... :-)
[jira] [Commented] (NUTCH-1619) Writes Dmoz Description and Title information to db with snippet argument
[ https://issues.apache.org/jira/browse/NUTCH-1619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13743621#comment-13743621 ] lufeng commented on NUTCH-1619: --- Hi Yasin, Do you forget to close the data store? good. > Writes Dmoz Description and Title information to db with snippet argument > - > > Key: NUTCH-1619 > URL: https://issues.apache.org/jira/browse/NUTCH-1619 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.1 >Reporter: Yasin Kılınç >Priority: Minor > Fix For: 2.3 > > Attachments: NUTCH-DMOZ-Snippet.patch > > > We need Dmoz information of fetched URLs can be written to database. So these > information can be used like snipppet by indexer of the search engine we are > working on. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira