subject:"\[GitHub\] nutch pull request\: Fix for NUTCH\-2139 contributed by jorgelbg"

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-22 Thread asfgit

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/78


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-21 Thread sebastian-nagel

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42675133
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
+
+if (hosts.contains(linkUrl))
+  continue;
+
+hosts.add(linkUrl);
+  }
+
+  addFilteredLink("outlinks", url.toString(), linkUrl, outHost,
+  filterOutlinks, doc);
+}
+  }
+} catch (MalformedURLException e) {
+  LOG.error("Malformed URL in {}: {}", url, e.getMessage());
+}
+
+// Add the inlinks
+if (null != inlinks) {
+  Iterator iterator = inlinks.iterator();
+  Set inlinkHosts = new HashSet();
+
+  try {
+while (iterator.hasNext()) {
--- End diff --

+1 great!

On 10/15/2015 09:20 PM, Jorge Luis Betancourt wrote:
> In 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
> :
> 
>> +
>> +  addFilteredLink("outlinks", url.toString(), linkUrl, outHost,
>> +  filterOutlinks, doc);
>> +}
>> +  }
>> +} catch (MalformedURLException e) {
>> +  LOG.error(

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42176268
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
+
+
+  outlinks.host.ignore
+  false
+  
+Ignore the outlinks that point out to the same host as the URL being 
indexed. 
+By default all outlinks are indexed.
+  
+
+
+
+  inlinks.host.ignore
+  false
+  
+Ignore the inlinks coming from the same host as the URL being indexed. 
By default 
+all inlinks are indexed.
+  
--- End diff --

Indeed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42170226
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
--- End diff --

Any thoughs on a good prefix to use? `index.links.outlinks.host.ignore` 
seems extremely large for my taste :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg

Github user jorgelbg commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42168120
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
+
+if (hosts.contains(linkUrl))
+  continue;
+
+hosts.add(linkUrl);
+  }
+
+  addFilteredLink("outlinks", url.toString(), linkUrl, outHost,
+  filterOutlinks, doc);
+}
+  }
+} catch (MalformedURLException e) {
+  LOG.error("Malformed URL in {}: {}", url, e.getMessage());
+}
+
+// Add the inlinks
+if (null != inlinks) {
+  Iterator iterator = inlinks.iterator();
+  Set inlinkHosts = new HashSet();
+
+  try {
+while (iterator.hasNext()) {
--- End diff --

I though the same, since the URL is already is fetched shouldn't be any 
trouble, but its an easy fix so I can put it inside the while loop.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42164537
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
+
+
+  outlinks.host.ignore
+  false
+  
+Ignore the outlinks that point out to the same host as the URL being 
indexed. 
+By default all outlinks are indexed.
+  
+
+
+
+  inlinks.host.ignore
+  false
+  
+Ignore the inlinks coming from the same host as the URL being indexed. 
By default 
+all inlinks are indexed.
+  
--- End diff --

If db.ignore.internal.links == true (that's the default) this option 
wouldn't change anything because host-internal-inlinks aren't even stored in 
linkdb. Could be worth to place a reference to db.ignore.internal.links.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42163943
  
--- Diff: conf/nutch-default.xml ---
@@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger 
value than 30, when using this
   
 
 
+
--- End diff --

by best practice, the 3 properties would share a common prefix, so that 
it's clear to which plugin they belong


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42161108
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
--- End diff --

see comment below regarding nesting of try-catch and loops: outlinks are 
not necessarily validated inside parse data


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42160948
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
--- End diff --

now linkurl is equal to outHost - could be simplified


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread sebastian-nagel

Github user sebastian-nagel commented on a diff in the pull request:

https://github.com/apache/nutch/pull/78#discussion_r42160759
  
--- Diff: 
src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java
 ---
@@ -0,0 +1,168 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.indexer.links;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.crawl.Inlink;
+import org.apache.nutch.crawl.Inlinks;
+import org.apache.nutch.indexer.IndexingException;
+import org.apache.nutch.indexer.IndexingFilter;
+import org.apache.nutch.indexer.NutchDocument;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.slf4j.LoggerFactory;
+
+import java.net.MalformedURLException;
+import java.net.URL;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.Set;
+
+/**
+ * An {@link org.apache.nutch.indexer.IndexingFilter} that adds
+ * outlinks and inlinks field(s) to the document.
+ *
+ * In case that you want to ignore the outlinks that point to the same host
+ * as the URL being indexed use the following settings in your 
configuration
+ * file:
+ *
+ * 
+ *   outlinks.host.ignore
+ *   true
+ * 
+ *
+ * The same configuration is available for inlinks:
+ *
+ * 
+ *   inlinks.host.ignore
+ *   true
+ * 
+ *
+ * To store only the host portion of each inlink URL or outlink URL add the
+ * following to your configuration file.
+ *
+ * 
+ *   links.hosts.only
+ *   false
+ * 
+ *
+ */
+public class LinksIndexingFilter implements IndexingFilter {
+
+  public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore";
+  public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore";
+  public final static String LINKS_ONLY_HOSTS = "links.hosts.only";
+
+  public final static org.slf4j.Logger LOG = LoggerFactory
+  .getLogger(LinksIndexingFilter.class);
+
+  private Configuration conf;
+  private boolean filterOutlinks;
+  private boolean filterInlinks;
+  private boolean indexHost;
+
+  @Override
+  public NutchDocument filter(NutchDocument doc, Parse parse, Text url,
+  CrawlDatum datum, Inlinks inlinks) throws IndexingException {
+
+// Add the outlinks
+Outlink[] outlinks = parse.getData().getOutlinks();
+
+try {
+  if (outlinks != null) {
+Set hosts = new HashSet();
+
+for (Outlink outlink : outlinks) {
+  String linkUrl = outlink.getToUrl();
+  String outHost = new URL(linkUrl).getHost();
+
+  if (indexHost) {
+linkUrl = new URL(outlink.getToUrl()).getHost();
+
+if (hosts.contains(linkUrl))
+  continue;
+
+hosts.add(linkUrl);
+  }
+
+  addFilteredLink("outlinks", url.toString(), linkUrl, outHost,
+  filterOutlinks, doc);
+}
+  }
+} catch (MalformedURLException e) {
+  LOG.error("Malformed URL in {}: {}", url, e.getMessage());
+}
+
+// Add the inlinks
+if (null != inlinks) {
+  Iterator iterator = inlinks.iterator();
+  Set inlinkHosts = new HashSet();
+
+  try {
+while (iterator.hasNext()) {
--- End diff --

Shouldn't the try-catch be inside the while loop? The first invalid URL 
would break the loop and cause remaining inlinks to be skipped from being 
indexed. Is this intended? Ok, this should never happen since inlinks are 
validated URLs pointing to already fetched documents.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

2015-10-15 Thread jorgelbg

GitHub user jorgelbg opened a pull request:

https://github.com/apache/nutch/pull/78

Fix for NUTCH-2139 contributed by jorgelbg

Basic indexing capabilities for inlinks and outlinks. 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jorgelbg/nutch NUTCH-2139

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/nutch/pull/78.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #78


commit f1d16ac509146aada0817d58d40bbcbfd0bad44d
Author: Jorge Luis Betancourt 
Date:   2015-10-15T16:34:37Z

Fix for NUTCH-2139 contributed by jorgelbg




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg

11 matches

Site Navigation

Mail list logo

Footer information