[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959627#comment-14959627 ] Balaji Gurumurthy commented on NUTCH-2141: -- When we concatenate the content from multiple pages and then try to load it back to the browser using JavascriptExecutor, more often than not we get exceptions ("Unterminated string literal", "Missing ; before statement" to name a few ) while executing the javascript string. Debugging these errors from all the pages' concatenated content is pain. Instead of concatenating the content and loading it back to driver and reading it from the driver back again in HTTPResponse class, just returning the concatenated result back to Nutch seemed better. > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959659#comment-14959659 ] Michael Joyce commented on NUTCH-2141: -- Cool makes sense. Do you have any examples? I'd like to poke as well. You're going to need to handle the screenshot functionality differently as well. getHTMLContent does more than just return the body content. We probably don't really need the DefalultMultiInteractionHandler example either if this basically replaces that. [~asitang] might have some ideas as well. > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error
Karanjeet Singh created NUTCH-2142: -- Summary: Nutch File Dump - FileNotFoundException (Invalid Argument) Error Key: NUTCH-2142 URL: https://issues.apache.org/jira/browse/NUTCH-2142 Project: Nutch Issue Type: Bug Components: tool, util Affects Versions: 1.10, 1.11 Environment: Operating System - Linux (RHEL 6.2) Reporter: Karanjeet Singh Fix For: 1.11 Got *FileNotFoundException* while running nutch dump. *Cause*: Character '?' in file name/extension producing the below error. *Error Details* java.io.FileNotFoundException: /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg? (Invalid argument) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.(FileOutputStream.java:221) at java.io.FileOutputStream.(FileOutputStream.java:171) at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222) at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959565#comment-14959565 ] ASF GitHub Bot commented on NUTCH-2139: --- Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42176268 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + + + + outlinks.host.ignore + false + +Ignore the outlinks that point out to the same host as the URL being indexed. +By default all outlinks are indexed. + + + + + inlinks.host.ignore + false + +Ignore the inlinks coming from the same host as the URL being indexed. By default +all inlinks are indexed. + --- End diff -- Indeed. > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2143) GeneratorJob ignores batch id passed as argument
Sebastian Nagel created NUTCH-2143: -- Summary: GeneratorJob ignores batch id passed as argument Key: NUTCH-2143 URL: https://issues.apache.org/jira/browse/NUTCH-2143 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 2.3.1 Reporter: Sebastian Nagel Priority: Blocker Fix For: 2.3.1 The batch id passed to GeneratorJob by option/argument -batchId is ignored and a generated batch id is used to mark the current batch. Log snippets from a run of bin/crawl: {noformat} bin/nutch generate ... -batchId 1444941073-14208 ... GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs Fetching : bin/nutch fetch ... 1444941073-14208 ... ... QueueFeeder finished: total 0 records. Hit by time limit :0 {noformat} The generated URLs are marked with the wrong batch id: {noformat} hbase(main):010:0> scan 'test_webpage' ROWCOLUMN+CELL org.apache.nutch:http/column=f:bid, timestamp=1444941077080, value=1444941074-858443668 ... org.apache.nutch:http/column=mk:_gnmrk_, timestamp=1444941077080, value=1444941074-858443668 {noformat} and fetcher will not fetch anything. This problem was reported by Sherban Drulea [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html],[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42176268 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + + + + outlinks.host.ignore + false + +Ignore the outlinks that point out to the same host as the URL being indexed. +By default all outlinks are indexed. + + + + + inlinks.host.ignore + false + +Ignore the inlinks coming from the same host as the URL being indexed. By default +all inlinks are indexed. + --- End diff -- Indeed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
GitHub user jorgelbg opened a pull request: https://github.com/apache/nutch/pull/78 Fix for NUTCH-2139 contributed by jorgelbg Basic indexing capabilities for inlinks and outlinks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jorgelbg/nutch NUTCH-2139 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/78.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #78 commit f1d16ac509146aada0817d58d40bbcbfd0bad44d Author: Jorge Luis BetancourtDate: 2015-10-15T16:34:37Z Fix for NUTCH-2139 contributed by jorgelbg --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Updated] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-2139: -- External issue ID: https://github.com/apache/nutch/pull/78 > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959186#comment-14959186 ] ASF GitHub Bot commented on NUTCH-2139: --- GitHub user jorgelbg opened a pull request: https://github.com/apache/nutch/pull/78 Fix for NUTCH-2139 contributed by jorgelbg Basic indexing capabilities for inlinks and outlinks. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jorgelbg/nutch NUTCH-2139 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/78.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #78 commit f1d16ac509146aada0817d58d40bbcbfd0bad44d Author: Jorge Luis BetancourtDate: 2015-10-15T16:34:37Z Fix for NUTCH-2139 contributed by jorgelbg > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42170226 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + --- End diff -- Any thoughs on a good prefix to use? `index.links.outlinks.host.ignore` seems extremely large for my taste :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959489#comment-14959489 ] ASF GitHub Bot commented on NUTCH-2139: --- Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42170226 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + --- End diff -- Any thoughs on a good prefix to use? `index.links.outlinks.host.ignore` seems extremely large for my taste :) > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959345#comment-14959345 ] Michael Joyce commented on NUTCH-2141: -- This was actually brought up in NUTCH-2108. There's also an [example handler | https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java] that was added to illustrate that as well. The handler wont actually be run multiple times so if you need to return concatenated content you need to do it in the handler and make sure it's returned appropriately. > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959349#comment-14959349 ] ASF GitHub Bot commented on NUTCH-2139: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42160948 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.links; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlink; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.slf4j.LoggerFactory; + +import java.net.MalformedURLException; +import java.net.URL; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Set; + +/** + * An {@link org.apache.nutch.indexer.IndexingFilter} that adds + * outlinks and inlinks field(s) to the document. + * + * In case that you want to ignore the outlinks that point to the same host + * as the URL being indexed use the following settings in your configuration + * file: + * + * + * outlinks.host.ignore + * true + * + * + * The same configuration is available for inlinks: + * + * + * inlinks.host.ignore + * true + * + * + * To store only the host portion of each inlink URL or outlink URL add the + * following to your configuration file. + * + * + * links.hosts.only + * false + * + * + */ +public class LinksIndexingFilter implements IndexingFilter { + + public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore"; + public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore"; + public final static String LINKS_ONLY_HOSTS = "links.hosts.only"; + + public final static org.slf4j.Logger LOG = LoggerFactory + .getLogger(LinksIndexingFilter.class); + + private Configuration conf; + private boolean filterOutlinks; + private boolean filterInlinks; + private boolean indexHost; + + @Override + public NutchDocument filter(NutchDocument doc, Parse parse, Text url, + CrawlDatum datum, Inlinks inlinks) throws IndexingException { + +// Add the outlinks +Outlink[] outlinks = parse.getData().getOutlinks(); + +try { + if (outlinks != null) { +Set hosts = new HashSet(); + +for (Outlink outlink : outlinks) { + String linkUrl = outlink.getToUrl(); + String outHost = new URL(linkUrl).getHost(); + + if (indexHost) { +linkUrl = new URL(outlink.getToUrl()).getHost(); --- End diff -- now linkurl is equal to outHost - could be simplified > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42160948 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.links; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlink; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.slf4j.LoggerFactory; + +import java.net.MalformedURLException; +import java.net.URL; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Set; + +/** + * An {@link org.apache.nutch.indexer.IndexingFilter} that adds + * outlinks and inlinks field(s) to the document. + * + * In case that you want to ignore the outlinks that point to the same host + * as the URL being indexed use the following settings in your configuration + * file: + * + * + * outlinks.host.ignore + * true + * + * + * The same configuration is available for inlinks: + * + * + * inlinks.host.ignore + * true + * + * + * To store only the host portion of each inlink URL or outlink URL add the + * following to your configuration file. + * + * + * links.hosts.only + * false + * + * + */ +public class LinksIndexingFilter implements IndexingFilter { + + public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore"; + public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore"; + public final static String LINKS_ONLY_HOSTS = "links.hosts.only"; + + public final static org.slf4j.Logger LOG = LoggerFactory + .getLogger(LinksIndexingFilter.class); + + private Configuration conf; + private boolean filterOutlinks; + private boolean filterInlinks; + private boolean indexHost; + + @Override + public NutchDocument filter(NutchDocument doc, Parse parse, Text url, + CrawlDatum datum, Inlinks inlinks) throws IndexingException { + +// Add the outlinks +Outlink[] outlinks = parse.getData().getOutlinks(); + +try { + if (outlinks != null) { +Set hosts = new HashSet(); + +for (Outlink outlink : outlinks) { + String linkUrl = outlink.getToUrl(); + String outHost = new URL(linkUrl).getHost(); + + if (indexHost) { +linkUrl = new URL(outlink.getToUrl()).getHost(); --- End diff -- now linkurl is equal to outHost - could be simplified --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42160759 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.links; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlink; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.slf4j.LoggerFactory; + +import java.net.MalformedURLException; +import java.net.URL; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Set; + +/** + * An {@link org.apache.nutch.indexer.IndexingFilter} that adds + * outlinks and inlinks field(s) to the document. + * + * In case that you want to ignore the outlinks that point to the same host + * as the URL being indexed use the following settings in your configuration + * file: + * + * + * outlinks.host.ignore + * true + * + * + * The same configuration is available for inlinks: + * + * + * inlinks.host.ignore + * true + * + * + * To store only the host portion of each inlink URL or outlink URL add the + * following to your configuration file. + * + * + * links.hosts.only + * false + * + * + */ +public class LinksIndexingFilter implements IndexingFilter { + + public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore"; + public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore"; + public final static String LINKS_ONLY_HOSTS = "links.hosts.only"; + + public final static org.slf4j.Logger LOG = LoggerFactory + .getLogger(LinksIndexingFilter.class); + + private Configuration conf; + private boolean filterOutlinks; + private boolean filterInlinks; + private boolean indexHost; + + @Override + public NutchDocument filter(NutchDocument doc, Parse parse, Text url, + CrawlDatum datum, Inlinks inlinks) throws IndexingException { + +// Add the outlinks +Outlink[] outlinks = parse.getData().getOutlinks(); + +try { + if (outlinks != null) { +Set hosts = new HashSet(); + +for (Outlink outlink : outlinks) { + String linkUrl = outlink.getToUrl(); + String outHost = new URL(linkUrl).getHost(); + + if (indexHost) { +linkUrl = new URL(outlink.getToUrl()).getHost(); + +if (hosts.contains(linkUrl)) + continue; + +hosts.add(linkUrl); + } + + addFilteredLink("outlinks", url.toString(), linkUrl, outHost, + filterOutlinks, doc); +} + } +} catch (MalformedURLException e) { + LOG.error("Malformed URL in {}: {}", url, e.getMessage()); +} + +// Add the inlinks +if (null != inlinks) { + Iterator iterator = inlinks.iterator(); + Set inlinkHosts = new HashSet(); + + try { +while (iterator.hasNext()) { --- End diff -- Shouldn't the try-catch be inside the while loop? The first invalid URL would break the loop and cause remaining inlinks to be skipped from being indexed. Is this intended? Ok, this should never happen since inlinks are validated URLs pointing to already fetched documents. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42168120 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.links; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlink; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.slf4j.LoggerFactory; + +import java.net.MalformedURLException; +import java.net.URL; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Set; + +/** + * An {@link org.apache.nutch.indexer.IndexingFilter} that adds + * outlinks and inlinks field(s) to the document. + * + * In case that you want to ignore the outlinks that point to the same host + * as the URL being indexed use the following settings in your configuration + * file: + * + * + * outlinks.host.ignore + * true + * + * + * The same configuration is available for inlinks: + * + * + * inlinks.host.ignore + * true + * + * + * To store only the host portion of each inlink URL or outlink URL add the + * following to your configuration file. + * + * + * links.hosts.only + * false + * + * + */ +public class LinksIndexingFilter implements IndexingFilter { + + public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore"; + public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore"; + public final static String LINKS_ONLY_HOSTS = "links.hosts.only"; + + public final static org.slf4j.Logger LOG = LoggerFactory + .getLogger(LinksIndexingFilter.class); + + private Configuration conf; + private boolean filterOutlinks; + private boolean filterInlinks; + private boolean indexHost; + + @Override + public NutchDocument filter(NutchDocument doc, Parse parse, Text url, + CrawlDatum datum, Inlinks inlinks) throws IndexingException { + +// Add the outlinks +Outlink[] outlinks = parse.getData().getOutlinks(); + +try { + if (outlinks != null) { +Set hosts = new HashSet(); + +for (Outlink outlink : outlinks) { + String linkUrl = outlink.getToUrl(); + String outHost = new URL(linkUrl).getHost(); + + if (indexHost) { +linkUrl = new URL(outlink.getToUrl()).getHost(); + +if (hosts.contains(linkUrl)) + continue; + +hosts.add(linkUrl); + } + + addFilteredLink("outlinks", url.toString(), linkUrl, outHost, + filterOutlinks, doc); +} + } +} catch (MalformedURLException e) { + LOG.error("Malformed URL in {}: {}", url, e.getMessage()); +} + +// Add the inlinks +if (null != inlinks) { + Iterator iterator = inlinks.iterator(); + Set inlinkHosts = new HashSet(); + + try { +while (iterator.hasNext()) { --- End diff -- I though the same, since the URL is already is fetched shouldn't be any trouble, but its an easy fix so I can put it inside the while loop. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42163943 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + --- End diff -- by best practice, the 3 properties would share a common prefix, so that it's clear to which plugin they belong --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959396#comment-14959396 ] ASF GitHub Bot commented on NUTCH-2139: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42163943 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + --- End diff -- by best practice, the 3 properties would share a common prefix, so that it's clear to which plugin they belong > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42164537 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + + + + outlinks.host.ignore + false + +Ignore the outlinks that point out to the same host as the URL being indexed. +By default all outlinks are indexed. + + + + + inlinks.host.ignore + false + +Ignore the inlinks coming from the same host as the URL being indexed. By default +all inlinks are indexed. + --- End diff -- If db.ignore.internal.links == true (that's the default) this option wouldn't change anything because host-internal-inlinks aren't even stored in linkdb. Could be worth to place a reference to db.ignore.internal.links. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959407#comment-14959407 ] ASF GitHub Bot commented on NUTCH-2139: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42164537 --- Diff: conf/nutch-default.xml --- @@ -1896,4 +1896,33 @@ CAUTION: Set the parser.timeout to -1 or a bigger value than 30, when using this + + + + outlinks.host.ignore + false + +Ignore the outlinks that point out to the same host as the URL being indexed. +By default all outlinks are indexed. + + + + + inlinks.host.ignore + false + +Ignore the inlinks coming from the same host as the URL being indexed. By default +all inlinks are indexed. + --- End diff -- If db.ignore.internal.links == true (that's the default) this option wouldn't change anything because host-internal-inlinks aren't even stored in linkdb. Could be worth to place a reference to db.ignore.internal.links. > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959460#comment-14959460 ] ASF GitHub Bot commented on NUTCH-2139: --- Github user jorgelbg commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42168120 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.links; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlink; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.slf4j.LoggerFactory; + +import java.net.MalformedURLException; +import java.net.URL; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Set; + +/** + * An {@link org.apache.nutch.indexer.IndexingFilter} that adds + * outlinks and inlinks field(s) to the document. + * + * In case that you want to ignore the outlinks that point to the same host + * as the URL being indexed use the following settings in your configuration + * file: + * + * + * outlinks.host.ignore + * true + * + * + * The same configuration is available for inlinks: + * + * + * inlinks.host.ignore + * true + * + * + * To store only the host portion of each inlink URL or outlink URL add the + * following to your configuration file. + * + * + * links.hosts.only + * false + * + * + */ +public class LinksIndexingFilter implements IndexingFilter { + + public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore"; + public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore"; + public final static String LINKS_ONLY_HOSTS = "links.hosts.only"; + + public final static org.slf4j.Logger LOG = LoggerFactory + .getLogger(LinksIndexingFilter.class); + + private Configuration conf; + private boolean filterOutlinks; + private boolean filterInlinks; + private boolean indexHost; + + @Override + public NutchDocument filter(NutchDocument doc, Parse parse, Text url, + CrawlDatum datum, Inlinks inlinks) throws IndexingException { + +// Add the outlinks +Outlink[] outlinks = parse.getData().getOutlinks(); + +try { + if (outlinks != null) { +Set hosts = new HashSet(); + +for (Outlink outlink : outlinks) { + String linkUrl = outlink.getToUrl(); + String outHost = new URL(linkUrl).getHost(); + + if (indexHost) { +linkUrl = new URL(outlink.getToUrl()).getHost(); + +if (hosts.contains(linkUrl)) + continue; + +hosts.add(linkUrl); + } + + addFilteredLink("outlinks", url.toString(), linkUrl, outHost, + filterOutlinks, doc); +} + } +} catch (MalformedURLException e) { + LOG.error("Malformed URL in {}: {}", url, e.getMessage()); +} + +// Add the inlinks +if (null != inlinks) { + Iterator iterator = inlinks.iterator(); + Set inlinkHosts = new HashSet(); + + try { +while (iterator.hasNext()) { --- End diff -- I though the same, since the URL is already is fetched shouldn't be any trouble, but its an easy fix so I can put it inside the while loop. > Basic plugin to index inlinks and outlinks > -- > >
[jira] [Commented] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14959352#comment-14959352 ] ASF GitHub Bot commented on NUTCH-2139: --- Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42161108 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.links; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlink; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.slf4j.LoggerFactory; + +import java.net.MalformedURLException; +import java.net.URL; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Set; + +/** + * An {@link org.apache.nutch.indexer.IndexingFilter} that adds + * outlinks and inlinks field(s) to the document. + * + * In case that you want to ignore the outlinks that point to the same host + * as the URL being indexed use the following settings in your configuration + * file: + * + * + * outlinks.host.ignore + * true + * + * + * The same configuration is available for inlinks: + * + * + * inlinks.host.ignore + * true + * + * + * To store only the host portion of each inlink URL or outlink URL add the + * following to your configuration file. + * + * + * links.hosts.only + * false + * + * + */ +public class LinksIndexingFilter implements IndexingFilter { + + public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore"; + public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore"; + public final static String LINKS_ONLY_HOSTS = "links.hosts.only"; + + public final static org.slf4j.Logger LOG = LoggerFactory + .getLogger(LinksIndexingFilter.class); + + private Configuration conf; + private boolean filterOutlinks; + private boolean filterInlinks; + private boolean indexHost; + + @Override + public NutchDocument filter(NutchDocument doc, Parse parse, Text url, + CrawlDatum datum, Inlinks inlinks) throws IndexingException { + +// Add the outlinks +Outlink[] outlinks = parse.getData().getOutlinks(); + +try { + if (outlinks != null) { --- End diff -- see comment below regarding nesting of try-catch and loops: outlinks are not necessarily validated inside parse data > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.11 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2139 contributed by jorgelbg
Github user sebastian-nagel commented on a diff in the pull request: https://github.com/apache/nutch/pull/78#discussion_r42161108 --- Diff: src/plugin/index-links/src/java/org/apache/nutch/indexer/links/LinksIndexingFilter.java --- @@ -0,0 +1,168 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.nutch.indexer.links; + +import org.apache.hadoop.conf.Configuration; +import org.apache.hadoop.io.Text; +import org.apache.nutch.crawl.CrawlDatum; +import org.apache.nutch.crawl.Inlink; +import org.apache.nutch.crawl.Inlinks; +import org.apache.nutch.indexer.IndexingException; +import org.apache.nutch.indexer.IndexingFilter; +import org.apache.nutch.indexer.NutchDocument; +import org.apache.nutch.parse.Outlink; +import org.apache.nutch.parse.Parse; +import org.slf4j.LoggerFactory; + +import java.net.MalformedURLException; +import java.net.URL; +import java.util.HashSet; +import java.util.Iterator; +import java.util.Set; + +/** + * An {@link org.apache.nutch.indexer.IndexingFilter} that adds + * outlinks and inlinks field(s) to the document. + * + * In case that you want to ignore the outlinks that point to the same host + * as the URL being indexed use the following settings in your configuration + * file: + * + * + * outlinks.host.ignore + * true + * + * + * The same configuration is available for inlinks: + * + * + * inlinks.host.ignore + * true + * + * + * To store only the host portion of each inlink URL or outlink URL add the + * following to your configuration file. + * + * + * links.hosts.only + * false + * + * + */ +public class LinksIndexingFilter implements IndexingFilter { + + public final static String LINKS_OUTLINKS_HOST = "outlinks.host.ignore"; + public final static String LINKS_INLINKS_HOST = "inlinks.host.ignore"; + public final static String LINKS_ONLY_HOSTS = "links.hosts.only"; + + public final static org.slf4j.Logger LOG = LoggerFactory + .getLogger(LinksIndexingFilter.class); + + private Configuration conf; + private boolean filterOutlinks; + private boolean filterInlinks; + private boolean indexHost; + + @Override + public NutchDocument filter(NutchDocument doc, Parse parse, Text url, + CrawlDatum datum, Inlinks inlinks) throws IndexingException { + +// Add the outlinks +Outlink[] outlinks = parse.getData().getOutlinks(); + +try { + if (outlinks != null) { --- End diff -- see comment below regarding nesting of try-catch and loops: outlinks are not necessarily validated inside parse data --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---