[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018413#comment-15018413
 ] 

Hudson commented on NUTCH-2069:
---

SUCCESS: Integrated in Nutch-trunk #3313 (See 
[https://builds.apache.org/job/Nutch-trunk/3313/])
NUTCH-2069 (jnioche: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1715386])
* trunk/CHANGES.txt
* trunk/conf/nutch-default.xml
* trunk/src/java/org/apache/nutch/fetcher/FetcherThread.java
* trunk/src/java/org/apache/nutch/parse/ParseOutputFormat.java


> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-20 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15018232#comment-15018232
 ] 

Julien Nioche commented on NUTCH-2069:
--

no probs. Would be good to find a way to format based on the Eclipse XML config 
with an ANT task. There is a way to do it with Maven but haven't seen one for 
ANT.

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15015387#comment-15015387
 ] 

Lewis John McGibbney commented on NUTCH-2069:
-

+1 for patch. Sorry about formatting folks. We can run the code formatter over 
trunk before we release ;)

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013616#comment-15013616
 ] 

Markus Jelsma commented on NUTCH-2069:
--

Ah, i see it now indeed. +1 for this patch

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013545#comment-15013545
 ] 

Julien Nioche commented on NUTCH-2069:
--

> I propose to modes to be named just 'host' and 'domain'. As they are 
> elsewhere.

Not really, see fetcher.queue.mode and partition.url.mode 
[https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L723]

This issue is not about fixing existing discrepancies, this should be addressed 
separately.

As for mixing bydomain and byDomain we do that only when comparing the strings

{code}
if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode))
{code}

changing to "byDomain" won't make any difference but feel free to change this 
if you feel strongly about it

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-19 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15013513#comment-15013513
 ] 

Markus Jelsma commented on NUTCH-2069:
--

Hi - looks good. One suggestion though. The patch mixes up bydomain and 
byDomain so this won't work well. I propose to modes to be named just 'host' 
and 'domain'. As they are elsewhere.

M.

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2069.patch, NUTCH-2069.v2.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-11-18 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15012193#comment-15012193
 ] 

Markus Jelsma commented on NUTCH-2069:
--

Hi J - i agree with the mode! Have it defaulted so it never breaks older 
instances and doesn't allow excluding both. Your follow up patch is probably 
spot on, have you got one? It can still come in 1.11!
M.

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Attachments: NUTCH-2069.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-30 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14647467#comment-14647467
 ] 

Julien Nioche commented on NUTCH-2069:
--

Hi [~wastl-nagel] and [~markus17].  BTW did not mean to be short in my previous 
message but was typing from my phone ;-)
I know the difficulties of enforcing the code formatting systematically, but I 
thought I might as well fix it while I was working on that part of the code. 
Feel free to remove the bits from the patch that are about the formatting only.

bq. we could define this as two properties `db.ignore.external.links` + 
`db.ignore.external.links.mode`. The latter can be "host" or "domain", similar 
to other properties (partition.url.mode, generator.count.mode, 
fetcher.queue.mode). That would be extensible and can make the code leaner.

yes that would be more elegant

on vacation for the next few weeks as of today, will update the code  based on 
your suggestion when I am back unless one of you beats me to it of course.

J.  



> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646793#comment-14646793
 ] 

Sebastian Nagel commented on NUTCH-2069:


You're right, Julien. The code in FetcherThread does not follow the style. The 
code formatting patch (NUTCH-865) is now 80 issues back in the history, Fetcher 
has been refactored meanwhile, and not all commits are following the style. 
It's often hard to resist :) ant not to correct the style, re-organize imports, 
etc., so that patches are lean and easy to review. But back to the main topic: 
+1, so far. One point: 'db.ignore.external.links' and the new 
'db.ignore.external.links.domain' are mutually exclusive, "external" is either 
defined by host or domain. This should be show up in the code
{code}
if (ignoreExternalLinks) { ... } else if (ignoreLinksOutsideDomain) { ... }
{code}
Or we could define this as two properties `db.ignore.external.links` + 
`db.ignore.external.links.mode`. The latter can be "host" or "domain", similar 
to other properties (partition.url.mode, generator.count.mode, 
fetcher.queue.mode). That would be extensible and can make the code leaner.

Btw., good idea to add the formatter to 1.x as well, and if possible 
automatically add it to the Eclipse project created by "ant eclipse". 

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646543#comment-14646543
 ] 

Julien Nioche commented on NUTCH-2069:
--

What code restyle? I applied the formatting rules from 2.x as expected. They 
should be copied to trunk BTW. Looks like Lewis did not use  them

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2069) Ignore external links based on domain

2015-07-29 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14646504#comment-14646504
 ] 

Markus Jelsma commented on NUTCH-2069:
--

Fine with the feature but there's a lot of clutter in the patch. Are you not 
happy with the code restyle Lewis did? I am not sure i see a lot of use for e.g
{code}
@@ -678,19 +697,18 @@
 
   // Check whether we'll follow external outlinks
   if (outlinksIgnoreExternal) {
-if (!URLUtil.getHost(url.toString()).equals(
-URLUtil.getHost(followUrl))) {
+if (!URLUtil.getHost(url.toString())
+.equals(URLUtil.getHost(followUrl))) {
   continue;
 }
   }
 
-  reporter
-  .incrCounter("FetcherOutlinks", "outlinks_following", 1);
+  reporter.incrCounter("FetcherOutlinks", "outlinks_following", 1);
 
   // Create new FetchItem with depth incremented
   FetchItem fit = FetchItem.create(new Text(followUrl),
-  new CrawlDatum(CrawlDatum.STATUS_LINKED, interval),
-  queueMode, outlinkDepth + 1);
+  new CrawlDatum(CrawlDatum.STATUS_LINKED, interval), 
queueMode,
+  outlinkDepth + 1);
   ((FetchItemQueues) fetchQueues).addFetchItem(fit);
 
   outlinkCounter++;
{code}

And besides, this would force me to completely rewrite some patches as well, 
which i already had because of the code style change ;)

> Ignore external links based on domain
> -
>
> Key: NUTCH-2069
> URL: https://issues.apache.org/jira/browse/NUTCH-2069
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher, parser
>Affects Versions: 1.10
>Reporter: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2069.patch
>
>
> We currently have `db.ignore.external.links` which is a nice way of 
> restricting the crawl based on the hostname. This adds a new parameter 
> 'db.ignore.external.links.domain' to do the same based on the domain.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)