[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171503#comment-15171503 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/93 > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2144. -- Resolution: Fixed OK all fixed thanks [~thammegowda]! {noformat} [chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master Counting objects: 224, done. Delta compression using up to 4 threads. Compressing objects: 100% (40/40), done. Writing objects: 100% (51/51), 10.10 KiB | 0 bytes/s, done. Total 51 (delta 25), reused 0 (delta 0) To https://git-wip-us.apache.org/repos/asf/nutch.git f5e430e..15c583e master -> master Branch master set up to track remote branch master from origin. [chipotle:~/tmp/nutch1.12] mattmann% {noformat} > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2144 Added an extension point and a plug...
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/93 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171487#comment-15171487 ] Lewis John McGibbney commented on NUTCH-: - Hi, I can replicate this on hbase-0.98.8-hadoop2, Hadoop 2.5.2 and 2.X branch. I am going to try and write a Unit test for this tomorrow. I'll post my unit test once I can. > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171369#comment-15171369 ] ASF GitHub Bot commented on NUTCH-2144: --- Github user thammegowda closed the pull request at: https://github.com/apache/nutch/pull/89 > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2144 : override db.ignore.external to ex...
Github user thammegowda closed the pull request at: https://github.com/apache/nutch/pull/89 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171366#comment-15171366 ] ASF GitHub Bot commented on NUTCH-2144: --- GitHub user thammegowda opened a pull request: https://github.com/apache/nutch/pull/93 NUTCH-2144 Added an extension point and a plugin to accept external links This PR is a duplicate of #89 Recreated due to the issues caused while moving to writable git. @chrismattmann You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/nutch NUTCH-2144 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/93.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #93 commit 2015703cfd32cae98b14d2fd6af5ac4396237c48 Author: Thamme GowdaDate: 2016-02-29T03:23:26Z NUTCH-2144 Added an extension point and a plugin that overrides db.ignore.external to accept external links commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9 Author: Thamme Gowda Date: 2016-02-29T03:29:09Z Add a sample config > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Assignee: Chris A. Mattmann >Priority: Minor > Fix For: 1.12 > > Attachments: ignore-exempt.patch, ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2144 Added an extension point and a plug...
GitHub user thammegowda opened a pull request: https://github.com/apache/nutch/pull/93 NUTCH-2144 Added an extension point and a plugin to accept external links This PR is a duplicate of #89 Recreated due to the issues caused while moving to writable git. @chrismattmann You can merge this pull request into a Git repository by running: $ git pull https://github.com/thammegowda/nutch NUTCH-2144 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/93.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #93 commit 2015703cfd32cae98b14d2fd6af5ac4396237c48 Author: Thamme GowdaDate: 2016-02-29T03:23:26Z NUTCH-2144 Added an extension point and a plugin that overrides db.ignore.external to accept external links commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9 Author: Thamme Gowda Date: 2016-02-29T03:29:09Z Add a sample config --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[Nutch Wiki] Trivial Update of "Nutch2Tutorial" by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "Nutch2Tutorial" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/Nutch2Tutorial?action=diff=16=17 == Obtaining Software and Configuration == * Grab the latest distribution of Nutch 2.X from [[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. '''Do NOT build the source yet'''. From now on we will refer to the directory where the Nutch code resides as $NUTCH_HOME. - * Download and configure HBase 0.98.8-hadoop. You can get it [[http://archive.apache.org/dist/hbase/|here]] ('''N.B.''' Each version of Gora is tied to a particular version of HBase, we therefore suggest you use this version if possible. If you decide to use another version of HBase please do not be surprised if the stack does not work. You should also obtain [[http://hbase.apache.org/book.html#quickstart|current documentation for HBase]] however please again take into consideration that the version of HBase we recommend you use may not correlate to the current documentation. Please keep this in mind and use your initiative. + * Download and configure HBase 0.98.8-hadoop2. You can get it [[http://archive.apache.org/dist/hbase/hbase-0.98.8/hbase-0.98.8-hadoop2-bin.tar.gz|here]] ('''N.B.''' Each version of Gora is tied to a particular version of HBase, we therefore suggest you use this version if possible. If you decide to use another version of HBase please do not be surprised if the stack does not work. You should also obtain [[http://hbase.apache.org/book.html#quickstart|current documentation for HBase]] however please again take into consideration that the version of HBase we recommend you use may not correlate to the current documentation. Please keep this in mind and use your initiative. * Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml along with all of the other Configuration options suggested within the [[http://wiki.apache.org/nutch/NutchTutorial|Nutch 1.x tutorial]]. {{{
[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_
[ https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-: Fix Version/s: 2.3.2 > re-fetch deletes all metadata except _csh_ and _rs_ > > > Key: NUTCH- > URL: https://issues.apache.org/jira/browse/NUTCH- > Project: Nutch > Issue Type: Bug > Components: crawldb >Affects Versions: 2.3.1 > Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and > hbase-0.98.8-hadoop2 >Reporter: Adnane B. >Assignee: Lewis John McGibbney > Fix For: 2.3.2 > > > This problem happens at the the second time I crawl a page > {code} > bin/nutch inject urls/ > bin/nutch generate -topN 1000 > bin/nutch fetch -all > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > seconde time (re-fetch) : > {code} > bin/nutch generate -topN 1000 --> batchid changes for all existing pages > bin/nutch fetch -all --> *** metadatas are delete for all pages already > crawled ** > bin/nutch parse -force -all > bin/nutch updatedb -all > {code} > I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2 > It happens only if the page has not changed > To reproduce easily, please add to nutch-site.xml : > {code} > > db.fetch.interval.default > 60 > The default number of seconds between re-fetches of a page (1 > minute) > > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1741) Support of Sitemaps in Nutch 2.x
[ https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1741: Fix Version/s: (was: 2.4) 2.3.2 > Support of Sitemaps in Nutch 2.x > > > Key: NUTCH-1741 > URL: https://issues.apache.org/jira/browse/NUTCH-1741 > Project: Nutch > Issue Type: New Feature > Components: fetcher, generator >Reporter: Alparslan Avcı >Assignee: cihad güzel > Labels: gsoc2015 > Fix For: 2.3.2 > > Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, > NUTCH-1741-v4.patch, NUTCH-1741.patch, NUTCH-1741v5.patch, > NUTCH-1741v6.patch, NUTCH-1741v7.patch, SitemapCrawlerLifeCycle.pdf, > SitemapDevelopmentFor2x.pdf > > > Sitemap support has to be implemented for 2.x branch. It is being discussed > in NUTCH-1465 for trunk. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[Nutch Wiki] Trivial Update of "UsingGit" by LewisJohnMcgibbney
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "UsingGit" page has been changed by LewisJohnMcgibbney: https://wiki.apache.org/nutch/UsingGit?action=diff=2=3 Apache Nutch uses the [[http://git-scm.com/|Git]] version control system. Apache provides writeable Git repositories hosted at [[https://git-wip-us.apache.org|https://git-wip-us.apache.org/]]. This guide assumes you have read the guides and information provided on the Apache Git-WP page. - = Migrating from an existing SVN checkout of Nutch to Git = + = Migrating from an existing SVN checkout of Nutch (trunk) to Git = If you need to migrate from an SVN checkout of Nutch to Git, follow these instructions below.
[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1
[ https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171264#comment-15171264 ] Tien Nguyen Manh commented on NUTCH-2234: - elasticsearch 2.1.1 use httpclient 4.3.6 > Upgrade to elasticsearch 2.1.1 > -- > > Key: NUTCH-2234 > URL: https://issues.apache.org/jira/browse/NUTCH-2234 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh >Assignee: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-2234.patch > > > Currently we use elasticsearch 1.x, We should upgrade to 2.x -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.1
[ https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tien Nguyen Manh updated NUTCH-2236: Attachment: NUTCH-2236.patch I run Nutch 1.11 on Hadoop 2.7.1 with this patch. We also need add this line to etc/hadoop/mapred-env.sh export HADOOP_USER_CLASSPATH_FIRST=true > Upgrade to Hadoop 2.7.1 > --- > > Key: NUTCH-2236 > URL: https://issues.apache.org/jira/browse/NUTCH-2236 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.11 >Reporter: Tien Nguyen Manh > Attachments: NUTCH-2236.patch > > > Upgrade to Hadoop 2.7.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2236) Upgrade to Hadoop 2.7.1
Tien Nguyen Manh created NUTCH-2236: --- Summary: Upgrade to Hadoop 2.7.1 Key: NUTCH-2236 URL: https://issues.apache.org/jira/browse/NUTCH-2236 Project: Nutch Issue Type: Improvement Affects Versions: 1.11 Reporter: Tien Nguyen Manh Upgrade to Hadoop 2.7.1 -- This message was sent by Atlassian JIRA (v6.3.4#6332)