[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-15 Thread Balaji Gurumurthy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959627#comment-14959627
 ] 

Balaji Gurumurthy commented on NUTCH-2141:
--

When we concatenate the content from multiple pages and then try to load it 
back to the browser using JavascriptExecutor, more often than not we get 
exceptions ("Unterminated string literal", "Missing ; before statement" to name 
a few ) while executing the javascript string. Debugging these errors from all 
the pages' concatenated content is pain.
Instead of concatenating the content and loading it back to driver and reading 
it from the driver back again in HTTPResponse class, just returning the 
concatenated result back to Nutch seemed better.

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959659#comment-14959659
 ] 

Michael Joyce commented on NUTCH-2141:
--

Cool makes sense. Do you have any examples? I'd like to poke as well. You're 
going to need to handle the screenshot functionality differently as well. 
getHTMLContent does more than just return the body content. We probably don't 
really need the DefalultMultiInteractionHandler example either if this 
basically replaces that. [~asitang] might have some ideas as well.

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error

2015-10-15 Thread Karanjeet Singh (JIRA)
Karanjeet Singh created NUTCH-2142:
--

 Summary: Nutch File Dump - FileNotFoundException (Invalid 
Argument) Error
 Key: NUTCH-2142
 URL: https://issues.apache.org/jira/browse/NUTCH-2142
 Project: Nutch
  Issue Type: Bug
  Components: tool, util
Affects Versions: 1.10, 1.11
 Environment: Operating System - Linux (RHEL 6.2)
Reporter: Karanjeet Singh
 Fix For: 1.11


Got *FileNotFoundException* while running nutch dump.

*Cause*: Character '?' in file name/extension producing the below error.

*Error Details*
java.io.FileNotFoundException: 
/media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg?
 (Invalid argument)
at java.io.FileOutputStream.open(Native Method)
at java.io.FileOutputStream.(FileOutputStream.java:221)
at java.io.FileOutputStream.(FileOutputStream.java:171)
at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222)
at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959565#comment-14959565
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42176268
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
+
+
+  outlinks.host.ignore
+  false
+  
+Ignore the outlinks that point out to the same host as the URL being 
indexed. 
+By default all outlinks are indexed.
+  
+
+
+
+  inlinks.host.ignore
+  false
+  
+Ignore the inlinks coming from the same host as the URL being indexed. 
By default 
+all inlinks are indexed.
+  
--- End diff --

Indeed.


> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2015-10-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2143:
--

 Summary: GeneratorJob ignores batch id passed as argument
 Key: NUTCH-2143
 URL: https://issues.apache.org/jira/browse/NUTCH-2143
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 2.3.1
Reporter: Sebastian Nagel
Priority: Blocker
 Fix For: 2.3.1


The batch id passed to GeneratorJob by option/argument -batchId  is ignored 
and a generated batch id is used to mark the current batch. Log snippets from a 
run of bin/crawl:
{noformat}
bin/nutch generate ... -batchId 1444941073-14208
...
GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs

Fetching : 
bin/nutch fetch ... 1444941073-14208 ...
...
QueueFeeder finished: total 0 records. Hit by time limit :0
{noformat}

The generated URLs are marked with the wrong batch id:
{noformat}
hbase(main):010:0> scan 'test_webpage'
ROWCOLUMN+CELL
 org.apache.nutch:http/column=f:bid, timestamp=1444941077080, 
value=1444941074-858443668
 ...
 org.apache.nutch:http/column=mk:_gnmrk_, timestamp=1444941077080, 
value=1444941074-858443668
{noformat}
and fetcher will not fetch anything. This problem was reported by Sherban 
Drulea 
[[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html],[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg
Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42176268
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
+
+
+  outlinks.host.ignore
+  false
+  
+Ignore the outlinks that point out to the same host as the URL being 
indexed. 
+By default all outlinks are indexed.
+  
+
+
+
+  inlinks.host.ignore
+  false
+  
+Ignore the inlinks coming from the same host as the URL being indexed. 
By default 
+all inlinks are indexed.
+  
--- End diff --

Indeed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg
GitHub user jorgelbg opened a pull request:

https://github.com/apache/nutch/pull/78

Fix for NUTCH-2139 contributed by jorgelbg

Basic indexing capabilities for inlinks and outlinks. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jorgelbg/nutch NUTCH-2139

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/78.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #78


commit f1d16ac509146aada0817d58d40bbcbfd0bad44d
Author: Jorge Luis Betancourt 
Date:   2015-10-15T16:34:37Z

Fix for NUTCH-2139 contributed by jorgelbg




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Updated] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2139:
--
External issue ID: https://github.com/apache/nutch/pull/78

> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959186#comment-14959186
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

GitHub user jorgelbg opened a pull request:

https://github.com/apache/nutch/pull/78

Fix for NUTCH-2139 contributed by jorgelbg

Basic indexing capabilities for inlinks and outlinks. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jorgelbg/nutch NUTCH-2139

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/78.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #78


commit f1d16ac509146aada0817d58d40bbcbfd0bad44d
Author: Jorge Luis Betancourt 
Date:   2015-10-15T16:34:37Z

Fix for NUTCH-2139 contributed by jorgelbg




> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg
Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42170226
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
--- End diff --

Any thoughs on a good prefix to use? `index.links.outlinks.host.ignore` 
seems extremely large for my taste :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959489#comment-14959489
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42170226
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
--- End diff --

Any thoughs on a good prefix to use? `index.links.outlinks.host.ignore` 
seems extremely large for my taste :)


> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-15 Thread Michael Joyce (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959345#comment-14959345
 ] 

Michael Joyce commented on NUTCH-2141:
--

This was actually brought up in NUTCH-2108. There's also an [example handler | 
https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java]
 that was added to illustrate that as well. The handler wont actually be run 
multiple times so if you need to return concatenated content you need to do it 
in the handler and make sure it's returned appropriately.

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959349#comment-14959349
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42160948
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
--- End diff --

now linkurl is equal to outHost - could be simplified


> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42160948
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
--- End diff --

now linkurl is equal to outHost - could be simplified


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42160759
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
+
+if (hosts.contains(linkUrl))
+  continue;
+
+hosts.add(linkUrl);
+  }
+
+  addFilteredLink("outlinks", url.toString(), linkUrl, outHost,
+  filterOutlinks, doc);
+}
+  }
+} catch (MalformedURLException e) {
+  LOG.error("Malformed URL in {}: {}", url, e.getMessage());
+}
+
+// Add the inlinks
+if (null != inlinks) {
+  Iterator iterator = inlinks.iterator();
+  Set inlinkHosts = new HashSet();
+
+  try {
+while (iterator.hasNext()) {
--- End diff --

Shouldn't the try-catch be inside the while loop? The first invalid URL 
would break the loop and cause remaining inlinks to be skipped from being 
indexed. Is this intended? Ok, this should never happen since inlinks are 
validated URLs pointing to already fetched documents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled 

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg
Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42168120
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
+
+if (hosts.contains(linkUrl))
+  continue;
+
+hosts.add(linkUrl);
+  }
+
+  addFilteredLink("outlinks", url.toString(), linkUrl, outHost,
+  filterOutlinks, doc);
+}
+  }
+} catch (MalformedURLException e) {
+  LOG.error("Malformed URL in {}: {}", url, e.getMessage());
+}
+
+// Add the inlinks
+if (null != inlinks) {
+  Iterator iterator = inlinks.iterator();
+  Set inlinkHosts = new HashSet();
+
+  try {
+while (iterator.hasNext()) {
--- End diff --

I though the same, since the URL is already is fetched shouldn't be any 
trouble, but its an easy fix so I can put it inside the while loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42163943
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
--- End diff --

by best practice, the 3 properties would share a common prefix, so that 
it's clear to which plugin they belong


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959396#comment-14959396
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42163943
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
--- End diff --

by best practice, the 3 properties would share a common prefix, so that 
it's clear to which plugin they belong


> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42164537
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
+
+
+  outlinks.host.ignore
+  false
+  
+Ignore the outlinks that point out to the same host as the URL being 
indexed. 
+By default all outlinks are indexed.
+  
+
+
+
+  inlinks.host.ignore
+  false
+  
+Ignore the inlinks coming from the same host as the URL being indexed. 
By default 
+all inlinks are indexed.
+  
--- End diff --

If db.ignore.internal.links == true (that's the default) this option 
wouldn't change anything because host-internal-inlinks aren't even stored in 
linkdb. Could be worth to place a reference to db.ignore.internal.links.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959407#comment-14959407
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42164537
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
+
+
+  outlinks.host.ignore
+  false
+  
+Ignore the outlinks that point out to the same host as the URL being 
indexed. 
+By default all outlinks are indexed.
+  
+
+
+
+  inlinks.host.ignore
+  false
+  
+Ignore the inlinks coming from the same host as the URL being indexed. 
By default 
+all inlinks are indexed.
+  
--- End diff --

If db.ignore.internal.links == true (that's the default) this option 
wouldn't change anything because host-internal-inlinks aren't even stored in 
linkdb. Could be worth to place a reference to db.ignore.internal.links.


> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959460#comment-14959460
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42168120
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
+
+if (hosts.contains(linkUrl))
+  continue;
+
+hosts.add(linkUrl);
+  }
+
+  addFilteredLink("outlinks", url.toString(), linkUrl, outHost,
+  filterOutlinks, doc);
+}
+  }
+} catch (MalformedURLException e) {
+  LOG.error("Malformed URL in {}: {}", url, e.getMessage());
+}
+
+// Add the inlinks
+if (null != inlinks) {
+  Iterator iterator = inlinks.iterator();
+  Set inlinkHosts = new HashSet();
+
+  try {
+while (iterator.hasNext()) {
--- End diff --

I though the same, since the URL is already is fetched shouldn't be any 
trouble, but its an easy fix so I can put it inside the while loop.


> Basic plugin to index inlinks and outlinks
> --
>
> 

[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959352#comment-14959352
 ] 

ASF GitHub Bot commented on NUTCH-2139:
---

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42161108
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
--- End diff --

see comment below regarding nesting of try-catch and loops: outlinks are 
not necessarily validated inside parse data


> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.11
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel
Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42161108
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
--- End diff --

see comment below regarding nesting of try-catch and loops: outlinks are 
not necessarily validated inside parse data


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---