[jira] [Updated] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
[ https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thamme Gowda N updated NUTCH-2144: -- Attachment: ignore-exempt.patch Patch supplied. Summary of changes: * A new plugin extension point is added: "URLExemptionFilter" * A new plugin is added "urlfilter-ignoreexempt" * A new conf file is added > Plugin to override db.ignore.external to exempt interesting external domain > URLs > > > Key: NUTCH-2144 > URL: https://issues.apache.org/jira/browse/NUTCH-2144 > Project: Nutch > Issue Type: New Feature > Components: crawldb, fetcher >Reporter: Thamme Gowda N >Priority: Minor > Attachments: ignore-exempt.patch > > > Create a rule based urlfilter plugin that allows focused crawler > (db.ignore.external.links=true) to fetch static resources from external > domains. > The generalized version of this: This plugin should permit interesting URLs > from external domains (by overriding db.ignore.external). The interesting > urls are decided from a combination of regex and mime-type rules. > Concrete use case: > When using Nutch to crawl images from a set of domains, the crawler needs > to fetch all images which may be linked from CDNs and other domains. In this > scenario, allowing all external links and then writing hundreds of regular > expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs
Thamme Gowda N created NUTCH-2144: - Summary: Plugin to override db.ignore.external to exempt interesting external domain URLs Key: NUTCH-2144 URL: https://issues.apache.org/jira/browse/NUTCH-2144 Project: Nutch Issue Type: New Feature Components: crawldb, fetcher Reporter: Thamme Gowda N Priority: Minor Create a rule based urlfilter plugin that allows focused crawler (db.ignore.external.links=true) to fetch static resources from external domains. The generalized version of this: This plugin should permit interesting URLs from external domains (by overriding db.ignore.external). The interesting urls are decided from a combination of regex and mime-type rules. Concrete use case: When using Nutch to crawl images from a set of domains, the crawler needs to fetch all images which may be linked from CDNs and other domains. In this scenario, allowing all external links and then writing hundreds of regular expressions is not feasible for large number of domains. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error
[ https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962592#comment-14962592 ] Hudson commented on NUTCH-2142: --- SUCCESS: Integrated in Nutch-trunk #3291 (See [https://builds.apache.org/job/Nutch-trunk/3291/]) Fix for NUTCH-2142: Nutch File Dump - FileNotFoundException (Invalid Argument) Error contributed by karanjeets this closes #76. (mattmann: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1709304]) * trunk/CHANGES.txt * trunk/src/java/org/apache/nutch/tools/FileDumper.java * trunk/src/java/org/apache/nutch/util/DumpFileUtil.java > Nutch File Dump - FileNotFoundException (Invalid Argument) Error > > > Key: NUTCH-2142 > URL: https://issues.apache.org/jira/browse/NUTCH-2142 > Project: Nutch > Issue Type: Bug > Components: tool, util >Affects Versions: 1.10, 1.11 > Environment: Operating System - Linux (RHEL 6.2) >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: dump, nutch > Fix For: 1.11 > > Original Estimate: 4h > Remaining Estimate: 4h > > Got *FileNotFoundException* while running nutch dump. > *Cause*: Character '?' in file name/extension producing the below error. > *Error Details* > java.io.FileNotFoundException: > /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg? > (Invalid argument) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222) > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962594#comment-14962594 ] Hudson commented on NUTCH-2129: --- SUCCESS: Integrated in Nutch-trunk #3291 (See [https://builds.apache.org/job/Nutch-trunk/3291/]) Fix for NUTCH-2129 - Add protocol status tracking to crawl datum contributed by Michael Joyce this closes #68. (mattmann: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1709306]) * trunk/CHANGES.txt * trunk/src/java/org/apache/nutch/metadata/Nutch.java * trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java * trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962593#comment-14962593 ] Hudson commented on NUTCH-2141: --- SUCCESS: Integrated in Nutch-trunk #3291 (See [https://builds.apache.org/job/Nutch-trunk/3291/]) Fix for NUTCH-2141: Change the InteractiveSelenium plugin handler Interface to return page content contributed by Balaji this closes #77 #75 (mattmann: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1709307]) * trunk/CHANGES.txt * trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java * trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java * trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java * trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java * trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy >Assignee: Chris A. Mattmann > Labels: selenium > Fix For: 1.11 > > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[DISCUSS] Release 1.11 RC #1 (70 issues fixed)
Hey Folks, I’ll cut a 1.11 RC #1 today. We have 70 issues fixed, and I think it would be a great time to release. Going to try for a Tika 1.11 release candidate 1 today too. Cheers, Chris ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: chris.a.mattm...@nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++
[jira] [Updated] (NUTCH-2133) Transfer Selenium Documentation to WIki
[ https://issues.apache.org/jira/browse/NUTCH-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2133: - Fix Version/s: (was: 1.11) (was: 2.4) 1.12 > Transfer Selenium Documentation to WIki > --- > > Key: NUTCH-2133 > URL: https://issues.apache.org/jira/browse/NUTCH-2133 > Project: Nutch > Issue Type: Improvement > Components: documentation >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce > Fix For: 1.12 > > > There's a decent chunk of Selenium related documentation stuck in READMEs for > various plugins. I would be nice to get this stuff pushed to the wiki. > E.G.: > https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-selenium/README.md -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2030) ParseZip plugin is not able to extract language from zip document,this could solve that problem.
[ https://issues.apache.org/jira/browse/NUTCH-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2030: - Fix Version/s: (was: 1.11) 1.12 > ParseZip plugin is not able to extract language from zip document,this could > solve that problem. > > > Key: NUTCH-2030 > URL: https://issues.apache.org/jira/browse/NUTCH-2030 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin > Environment: Linux Mint 17 qiana, 4 GB Ram,Core I3. >Reporter: Eyeris Rodriguez Rueda >Priority: Minor > Fix For: 1.12 > > Original Estimate: 336h > Remaining Estimate: 336h > > Actually parse-zip plugin don´t extract language from zip document, therefore > lang field is empty in solr or elastic. If the package(.zip) contains a list > of documents so the lang field could be multivalued to support that list of > languages. A simple change to parse-zip pluging could fix this problem. I > will use Language Identifier class from tika and analyze each document inside. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2086) Nutch 1.X Webui
[ https://issues.apache.org/jira/browse/NUTCH-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2086: - Fix Version/s: (was: 1.11) 1.12 > Nutch 1.X Webui > > > Key: NUTCH-2086 > URL: https://issues.apache.org/jira/browse/NUTCH-2086 > Project: Nutch > Issue Type: New Feature > Components: REST_api, web gui >Reporter: Sujen Shah >Assignee: Chris A. Mattmann > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2086.patch > > > To port the Apache Wicket based webui in Nutch 2.X to 1.X -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events
[ https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2132: - Fix Version/s: (was: 1.11) 1.12 > Publisher/Subscriber model for Nutch to emit events > > > Key: NUTCH-2132 > URL: https://issues.apache.org/jira/browse/NUTCH-2132 > Project: Nutch > Issue Type: New Feature > Components: fetcher, REST_api >Reporter: Sujen Shah > Labels: memex > Fix For: 1.12 > > Attachments: NUTCH-2132.patch > > > It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- > Fetcher events like fetch-start, fetch-end, a fetch report which may contain > data like outlinks of the current fetched url, score, etc). > A consumer of this functionality could use this data to generate real time > visualization and generate statics of the crawl without having to wait for > the fetch round to finish. > The REST API could contain an endpoint which would respond with a url to > which a client could subscribe to get the fetcher events. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2140) Atomic update and optimistic concurrency update using Solr
[ https://issues.apache.org/jira/browse/NUTCH-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2140: - Fix Version/s: (was: 1.11) 1.12 > Atomic update and optimistic concurrency update using Solr > -- > > Key: NUTCH-2140 > URL: https://issues.apache.org/jira/browse/NUTCH-2140 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Affects Versions: 1.9 >Reporter: Roannel Fernández Hernández > Fix For: 1.12 > > > The SOLRIndexWriter plugin allows to index the documents into a Solr server. > The plugin replaces the documents that already are indexed into Solr. > Sometimes, replace only one field or add new fields and keep the others > values of the documents indexed is useful. > Solr supports two approaches for this task: Atomic update and optimistic > concurrency update. However, the SOLRIndexWriter plugin doesn't support that > approaches. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2120) Remove MapWritable from trunk codebase
[ https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2120: - Fix Version/s: (was: 1.11) 1.12 > Remove MapWritable from trunk codebase > -- > > Key: NUTCH-2120 > URL: https://issues.apache.org/jira/browse/NUTCH-2120 > Project: Nutch > Issue Type: Bug >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney >Priority: Trivial > Fix For: 1.12 > > > [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm] > has been deprecated for a good while. > We should remove it from the codebase and make sure we are not using it > anywhere (I don't think we are). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2139) Basic plugin to index inlinks and outlinks
[ https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2139: - Fix Version/s: (was: 1.11) 1.12 > Basic plugin to index inlinks and outlinks > -- > > Key: NUTCH-2139 > URL: https://issues.apache.org/jira/browse/NUTCH-2139 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: link, plugin > Fix For: 1.12 > > > Basic plugin that allows to index the inlinks and outlinks of the web pages, > this could be very useful for analytic purposes, including neat > visualizations using d3.js for instance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2122) Implement Javadoc package.html for service packages
[ https://issues.apache.org/jira/browse/NUTCH-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2122: - Fix Version/s: (was: 1.11) 1.12 > Implement Javadoc package.html for service packages > --- > > Key: NUTCH-2122 > URL: https://issues.apache.org/jira/browse/NUTCH-2122 > Project: Nutch > Issue Type: Improvement > Components: nutch server >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Priority: Trivial > Fix For: 1.12 > > > [~sujenshah] I noticed that the Javadoc does not contain package.html > displaying package level introductory Javadoc as every other package does. > http://nutch.apache.org/apidocs/apidocs-1.10/index.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2135) Ant Eclipse build does not include protocol-interactiveselenium
[ https://issues.apache.org/jira/browse/NUTCH-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2135: - Fix Version/s: (was: 1.11) 1.12 > Ant Eclipse build does not include protocol-interactiveselenium > --- > > Key: NUTCH-2135 > URL: https://issues.apache.org/jira/browse/NUTCH-2135 > Project: Nutch > Issue Type: Improvement > Components: protocol >Reporter: Sujen Shah >Priority: Minor > Labels: memex > Fix For: 1.12 > > > target eclipse in the build.xml file does not include > protocol-interactiveselenium so while importing the project into eclipse, it > does not add that folder. > On adding that to the build file, I found that eclipse throws errors as the > package naming in classes belonging to the > org.apache.nutch.protocol.interactiveselenium.handlers is incomplete. > Have made both those changes in this PR. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2128) Refactor configuration end point
[ https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2128: - Fix Version/s: (was: 1.11) 1.12 > Refactor configuration end point > > > Key: NUTCH-2128 > URL: https://issues.apache.org/jira/browse/NUTCH-2128 > Project: Nutch > Issue Type: Sub-task > Components: REST_api >Reporter: Sujen Shah >Assignee: Sujen Shah >Priority: Minor > Fix For: 1.12 > > > To better define the endpoint to create a new configuration and add a new > endpoint to update a particular property value of a configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-1943) Form authentication should not be global and ignore
[ https://issues.apache.org/jira/browse/NUTCH-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-1943: - Fix Version/s: (was: 1.11) 1.12 > Form authentication should not be global and ignore > --- > > Key: NUTCH-1943 > URL: https://issues.apache.org/jira/browse/NUTCH-1943 > Project: Nutch > Issue Type: Improvement > Components: plugin, protocol >Affects Versions: 1.10 >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: memex > Fix For: 1.12 > > > Taken from [~wastl-nagel]'s comments on NUTCH-827 > bq. the form authentication is global and ignores . So you have to > restrict your crawl to the form authentication pages only. Ideally, also form > authentication should be bound to a scope (one host, one URL prefix, etc.) > same as HTTP authentication. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters
[ https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2064: - Fix Version/s: (was: 1.11) 1.12 > URLNormalizer basic to properly encode non-ASCII characters > --- > > Key: NUTCH-2064 > URL: https://issues.apache.org/jira/browse/NUTCH-2064 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.10 >Reporter: Markus Jelsma > Fix For: 1.12 > > Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, > NUTCH-2064.patch > > > NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2141. -- Resolution: Fixed Fix Version/s: 1.11 Thanks [~BalaJira] [~jo...@apache.org] plenty to improve on but a great start! {noformat} [chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2141: Change the InteractiveSelenium plugin handler Interface to return page content contributed by Balaji this closes #77 #75" SendingCHANGES.txt Sending src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java Sending src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java Sending src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java Sending src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java Sending src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java Transmitting file data .. Committed revision 1709307. [chipotle:~/tmp/nutch1.11] mattmann% {noformat} > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy >Assignee: Chris A. Mattmann > Labels: selenium > Fix For: 1.11 > > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Trunk
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/75 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962574#comment-14962574 ] ASF GitHub Bot commented on NUTCH-2141: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/77 > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy >Assignee: Chris A. Mattmann > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: fix for NUTCH-2141 contributed by Balaji Gurum...
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/77 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Work started] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2141 started by Chris A. Mattmann. > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy >Assignee: Chris A. Mattmann > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2129. -- Resolution: Fixed Thanks [~jo...@apache.org]! {noformat} [chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2129 - Add protocol status tracking to crawl datum contributed by Michael Joyce this closes #68." SendingCHANGES.txt Sendingsrc/java/org/apache/nutch/metadata/Nutch.java Sending src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Sending src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java Transmitting file data Committed revision 1709306. {noformat} > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content
[ https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2141: Assignee: Chris A. Mattmann > Change the InteractiveSelenium plugin handler Interface to return page content > -- > > Key: NUTCH-2141 > URL: https://issues.apache.org/jira/browse/NUTCH-2141 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Balaji Gurumurthy >Assignee: Chris A. Mattmann > Labels: selenium > > The handler interface in the protocol-interactiveselenium plugin currently > provide methods to manipulate the page content and the HTTPResponse class > read's the page content from the driver. This limits the amount of HTML > content that could be returned to nutch. > The processDriver method could return a String object instead. This is > particularly helpful in cases such as handling pagination when multiple > pages' content can be appended and returned from the handler. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962572#comment-14962572 ] ASF GitHub Bot commented on NUTCH-2129: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/68 > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2129 - Add protocol status tracking to c...
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/68 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Work started] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2129 started by Chris A. Mattmann. > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error
[ https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2142. -- Resolution: Fixed Thanks [~karanjeets]! {noformat} [chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2142: Nutch File Dump - FileNotFoundException (Invalid Argument) Error contributed by karanjeets this closes #76." SendingCHANGES.txt Sendingsrc/java/org/apache/nutch/tools/FileDumper.java Sendingsrc/java/org/apache/nutch/util/DumpFileUtil.java Transmitting file data ... Committed revision 1709304. [chipotle:~/tmp/nutch1.11] mattmann% {noformat} > Nutch File Dump - FileNotFoundException (Invalid Argument) Error > > > Key: NUTCH-2142 > URL: https://issues.apache.org/jira/browse/NUTCH-2142 > Project: Nutch > Issue Type: Bug > Components: tool, util >Affects Versions: 1.10, 1.11 > Environment: Operating System - Linux (RHEL 6.2) >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: dump, nutch > Fix For: 1.11 > > Original Estimate: 4h > Remaining Estimate: 4h > > Got *FileNotFoundException* while running nutch dump. > *Cause*: Character '?' in file name/extension producing the below error. > *Error Details* > java.io.FileNotFoundException: > /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg? > (Invalid argument) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222) > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2129) Track Protocol Status in Crawl Datum
[ https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2129: Assignee: Chris A. Mattmann > Track Protocol Status in Crawl Datum > > > Key: NUTCH-2129 > URL: https://issues.apache.org/jira/browse/NUTCH-2129 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3, 1.10 >Reporter: Michael Joyce >Assignee: Chris A. Mattmann > Fix For: 2.4, 1.11 > > > It's become necessary on a few crawls that I run to get protocol status code > stats. After speaking with [~lewismc] it seemed that there might not be a > super convenient way of doing this as is, but it would be great to be able to > add the functionality necessary to pull this information out. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fixed FileNotFoundException (Invalid Argument)...
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/76 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Work started] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error
[ https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2142 started by Chris A. Mattmann. > Nutch File Dump - FileNotFoundException (Invalid Argument) Error > > > Key: NUTCH-2142 > URL: https://issues.apache.org/jira/browse/NUTCH-2142 > Project: Nutch > Issue Type: Bug > Components: tool, util >Affects Versions: 1.10, 1.11 > Environment: Operating System - Linux (RHEL 6.2) >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: dump, nutch > Fix For: 1.11 > > Original Estimate: 4h > Remaining Estimate: 4h > > Got *FileNotFoundException* while running nutch dump. > *Cause*: Character '?' in file name/extension producing the below error. > *Error Details* > java.io.FileNotFoundException: > /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg? > (Invalid argument) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222) > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error
[ https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2142: Assignee: Chris A. Mattmann > Nutch File Dump - FileNotFoundException (Invalid Argument) Error > > > Key: NUTCH-2142 > URL: https://issues.apache.org/jira/browse/NUTCH-2142 > Project: Nutch > Issue Type: Bug > Components: tool, util >Affects Versions: 1.10, 1.11 > Environment: Operating System - Linux (RHEL 6.2) >Reporter: Karanjeet Singh >Assignee: Chris A. Mattmann > Labels: dump, nutch > Fix For: 1.11 > > Original Estimate: 4h > Remaining Estimate: 4h > > Got *FileNotFoundException* while running nutch dump. > *Cause*: Character '?' in file name/extension producing the below error. > *Error Details* > java.io.FileNotFoundException: > /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg? > (Invalid argument) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222) > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error
[ https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962295#comment-14962295 ] Karanjeet Singh commented on NUTCH-2142: This has been completed under GitHub pull request (https://github.com/apache/nutch/pull/76) > Nutch File Dump - FileNotFoundException (Invalid Argument) Error > > > Key: NUTCH-2142 > URL: https://issues.apache.org/jira/browse/NUTCH-2142 > Project: Nutch > Issue Type: Bug > Components: tool, util >Affects Versions: 1.10, 1.11 > Environment: Operating System - Linux (RHEL 6.2) >Reporter: Karanjeet Singh > Labels: dump, nutch > Fix For: 1.11 > > Original Estimate: 4h > Remaining Estimate: 4h > > Got *FileNotFoundException* while running nutch dump. > *Cause*: Character '?' in file name/extension producing the below error. > *Error Details* > java.io.FileNotFoundException: > /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg? > (Invalid argument) > at java.io.FileOutputStream.open(Native Method) > at java.io.FileOutputStream.(FileOutputStream.java:221) > at java.io.FileOutputStream.(FileOutputStream.java:171) > at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222) > at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325) -- This message was sent by Atlassian JIRA (v6.3.4#6332)