[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171503#comment-15171503
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/93


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2144.
--
Resolution: Fixed

OK all fixed thanks [~thammegowda]!

{noformat}
[chipotle:~/tmp/nutch1.12] mattmann% git push -u origin master
Counting objects: 224, done.
Delta compression using up to 4 threads.
Compressing objects: 100% (40/40), done.
Writing objects: 100% (51/51), 10.10 KiB | 0 bytes/s, done.
Total 51 (delta 25), reused 0 (delta 0)
To https://git-wip-us.apache.org/repos/asf/nutch.git
   f5e430e..15c583e  master -> master
Branch master set up to track remote branch master from origin.
[chipotle:~/tmp/nutch1.12] mattmann% 
{noformat}


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2144 Added an extension point and a plug...

2016-02-28 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/93


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-28 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171487#comment-15171487
 ] 

Lewis John McGibbney commented on NUTCH-:
-

Hi, I can replicate this on hbase-0.98.8-hadoop2, Hadoop 2.5.2 and 2.X branch.
I am going to try and write a Unit test for this tomorrow. I'll post my unit 
test once I can.  

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171369#comment-15171369
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

Github user thammegowda closed the pull request at:

https://github.com/apache/nutch/pull/89


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2144 : override db.ignore.external to ex...

2016-02-28 Thread thammegowda
Github user thammegowda closed the pull request at:

https://github.com/apache/nutch/pull/89


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171366#comment-15171366
 ] 

ASF GitHub Bot commented on NUTCH-2144:
---

GitHub user thammegowda opened a pull request:

https://github.com/apache/nutch/pull/93

NUTCH-2144 Added an extension point and a plugin to accept external links

This PR is a duplicate of #89 
Recreated due to the issues caused while moving to writable git.


@chrismattmann 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/nutch NUTCH-2144

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #93


commit 2015703cfd32cae98b14d2fd6af5ac4396237c48
Author: Thamme Gowda 
Date:   2016-02-29T03:23:26Z

NUTCH-2144 Added an extension point and a plugin that overrides 
db.ignore.external to accept external links

commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9
Author: Thamme Gowda 
Date:   2016-02-29T03:29:09Z

Add a sample config




> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2144 Added an extension point and a plug...

2016-02-28 Thread thammegowda
GitHub user thammegowda opened a pull request:

https://github.com/apache/nutch/pull/93

NUTCH-2144 Added an extension point and a plugin to accept external links

This PR is a duplicate of #89 
Recreated due to the issues caused while moving to writable git.


@chrismattmann 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/thammegowda/nutch NUTCH-2144

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/93.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #93


commit 2015703cfd32cae98b14d2fd6af5ac4396237c48
Author: Thamme Gowda 
Date:   2016-02-29T03:23:26Z

NUTCH-2144 Added an extension point and a plugin that overrides 
db.ignore.external to accept external links

commit 9a284c0d6d2aec86b00016a8abeddc07e5292ee9
Author: Thamme Gowda 
Date:   2016-02-29T03:29:09Z

Add a sample config




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[Nutch Wiki] Trivial Update of "Nutch2Tutorial" by LewisJohnMcgibbney

2016-02-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "Nutch2Tutorial" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/Nutch2Tutorial?action=diff=16=17

  == Obtaining Software and Configuration ==
  
   * Grab the latest distribution of Nutch 2.X from 
[[http://www.apache.org/dyn/closer.cgi/nutch/|here]]. '''Do NOT build the 
source yet'''. From now on we will refer to the directory where the Nutch code 
resides as $NUTCH_HOME.
-  * Download and configure HBase 0.98.8-hadoop. You can get it 
[[http://archive.apache.org/dist/hbase/|here]] ('''N.B.''' Each version of Gora 
is tied to a particular version of HBase, we therefore suggest you use this 
version if possible. If you decide to use another version of HBase please do 
not be surprised if the stack does not work. You should also obtain 
[[http://hbase.apache.org/book.html#quickstart|current documentation for 
HBase]] however please again take into consideration that the version of HBase 
we recommend you use may not correlate to the current documentation. Please 
keep this in mind and use your initiative.
+  * Download and configure HBase 0.98.8-hadoop2. You can get it 
[[http://archive.apache.org/dist/hbase/hbase-0.98.8/hbase-0.98.8-hadoop2-bin.tar.gz|here]]
 ('''N.B.''' Each version of Gora is tied to a particular version of HBase, we 
therefore suggest you use this version if possible. If you decide to use 
another version of HBase please do not be surprised if the stack does not work. 
You should also obtain [[http://hbase.apache.org/book.html#quickstart|current 
documentation for HBase]] however please again take into consideration that the 
version of HBase we recommend you use may not correlate to the current 
documentation. Please keep this in mind and use your initiative.
   * Specify the GORA backend in $NUTCH_HOME/conf/nutch-site.xml along with all 
of the other Configuration options suggested within the 
[[http://wiki.apache.org/nutch/NutchTutorial|Nutch 1.x tutorial]].
  
  {{{


[jira] [Updated] (NUTCH-2222) re-fetch deletes all metadata except _csh_ and _rs_

2016-02-28 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-:

Fix Version/s: 2.3.2

> re-fetch deletes all  metadata except _csh_ and _rs_
> 
>
> Key: NUTCH-
> URL: https://issues.apache.org/jira/browse/NUTCH-
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.3.1
> Environment: Centos 6, mongodb 2.6 and mongodb 3.0 and 
> hbase-0.98.8-hadoop2
>Reporter: Adnane B.
>Assignee: Lewis John McGibbney
> Fix For: 2.3.2
>
>
> This problem happens at the the second time I crawl a page
> {code}
> bin/nutch inject urls/
> bin/nutch generate -topN 1000
> bin/nutch fetch  -all
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> seconde time (re-fetch) : 
> {code}
> bin/nutch generate -topN 1000 --> batchid changes for all existing pages
> bin/nutch fetch  -all   -->  *** metadatas are delete for all pages already 
> crawled  **
> bin/nutch parse -force   -all
> bin/nutch updatedb  -all
> {code}
> I reproduce it with mongodb 2.6, mongodb 3.0, and hbase-0.98.8-hadoop2
> It happens only if the page has not changed
> To reproduce easily, please add to nutch-site.xml :
> {code}
> 
>   db.fetch.interval.default
>   60
>   The default number of seconds between re-fetches of a page (1 
> minute)
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1741) Support of Sitemaps in Nutch 2.x

2016-02-28 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1741?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1741:

Fix Version/s: (was: 2.4)
   2.3.2

> Support of Sitemaps in Nutch 2.x
> 
>
> Key: NUTCH-1741
> URL: https://issues.apache.org/jira/browse/NUTCH-1741
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Alparslan Avcı
>Assignee: cihad güzel
>  Labels: gsoc2015
> Fix For: 2.3.2
>
> Attachments: NUTCH-1741-v2.patch, NUTCH-1741-v3.patch, 
> NUTCH-1741-v4.patch, NUTCH-1741.patch, NUTCH-1741v5.patch, 
> NUTCH-1741v6.patch, NUTCH-1741v7.patch, SitemapCrawlerLifeCycle.pdf, 
> SitemapDevelopmentFor2x.pdf
>
>
> Sitemap support has to be implemented for 2.x branch. It is being discussed 
> in NUTCH-1465 for trunk. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[Nutch Wiki] Trivial Update of "UsingGit" by LewisJohnMcgibbney

2016-02-28 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "UsingGit" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/UsingGit?action=diff=2=3

  Apache Nutch uses the [[http://git-scm.com/|Git]] version control system. 
Apache provides writeable Git repositories hosted at 
[[https://git-wip-us.apache.org|https://git-wip-us.apache.org/]]. This guide 
assumes you have read the guides and information provided on the Apache Git-WP 
page.
  
- = Migrating from an existing SVN checkout of Nutch to Git =
+ = Migrating from an existing SVN checkout of Nutch (trunk) to Git =
  
  If you need to migrate from an SVN checkout of Nutch to Git, follow these 
instructions below.
  


[jira] [Commented] (NUTCH-2234) Upgrade to elasticsearch 2.1.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15171264#comment-15171264
 ] 

Tien Nguyen Manh commented on NUTCH-2234:
-

elasticsearch 2.1.1 use httpclient 4.3.6

> Upgrade to elasticsearch 2.1.1
> --
>
> Key: NUTCH-2234
> URL: https://issues.apache.org/jira/browse/NUTCH-2234
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2234.patch
>
>
> Currently we use elasticsearch 1.x, We should upgrade to 2.x



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tien Nguyen Manh updated NUTCH-2236:

Attachment: NUTCH-2236.patch

I run Nutch 1.11 on Hadoop 2.7.1 with this patch.
We also need add this line to etc/hadoop/mapred-env.sh
export HADOOP_USER_CLASSPATH_FIRST=true


> Upgrade to Hadoop 2.7.1
> ---
>
> Key: NUTCH-2236
> URL: https://issues.apache.org/jira/browse/NUTCH-2236
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Tien Nguyen Manh
> Attachments: NUTCH-2236.patch
>
>
> Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2236) Upgrade to Hadoop 2.7.1

2016-02-28 Thread Tien Nguyen Manh (JIRA)
Tien Nguyen Manh created NUTCH-2236:
---

 Summary: Upgrade to Hadoop 2.7.1
 Key: NUTCH-2236
 URL: https://issues.apache.org/jira/browse/NUTCH-2236
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.11
Reporter: Tien Nguyen Manh


Upgrade to Hadoop 2.7.1



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)