[jira] [Updated] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2015-10-18 Thread Thamme Gowda N (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thamme Gowda N updated NUTCH-2144:
--
Attachment: ignore-exempt.patch

Patch supplied.

Summary of changes:
* A new plugin extension point is added: "URLExemptionFilter"
* A new plugin is added "urlfilter-ignoreexempt"
* A new conf file is added


> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Priority: Minor
> Attachments: ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2015-10-18 Thread Thamme Gowda N (JIRA)
Thamme Gowda N created NUTCH-2144:
-

 Summary: Plugin to override db.ignore.external to exempt 
interesting external domain URLs
 Key: NUTCH-2144
 URL: https://issues.apache.org/jira/browse/NUTCH-2144
 Project: Nutch
  Issue Type: New Feature
  Components: crawldb, fetcher
Reporter: Thamme Gowda N
Priority: Minor


Create a rule based urlfilter plugin that allows focused crawler 
(db.ignore.external.links=true) to fetch static resources from external domains.
The generalized version of this: This plugin should permit interesting URLs 
from external domains (by overriding db.ignore.external). The interesting urls 
are decided from a combination of regex and mime-type rules.


Concrete use case:
  When using Nutch to crawl images from a set of domains, the crawler needs to 
fetch all images which may be linked from CDNs and other domains. In this 
scenario, allowing all external links and then writing hundreds of regular 
expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error

2015-10-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962592#comment-14962592
 ] 

Hudson commented on NUTCH-2142:
---

SUCCESS: Integrated in Nutch-trunk #3291 (See 
[https://builds.apache.org/job/Nutch-trunk/3291/])
Fix for NUTCH-2142: Nutch File Dump - FileNotFoundException (Invalid Argument) 
Error contributed by karanjeets  this closes #76. 
(mattmann: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1709304])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/tools/FileDumper.java
* trunk/src/java/org/apache/nutch/util/DumpFileUtil.java


> Nutch File Dump - FileNotFoundException (Invalid Argument) Error
> 
>
> Key: NUTCH-2142
> URL: https://issues.apache.org/jira/browse/NUTCH-2142
> Project: Nutch
>  Issue Type: Bug
>  Components: tool, util
>Affects Versions: 1.10, 1.11
> Environment: Operating System - Linux (RHEL 6.2)
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: dump, nutch
> Fix For: 1.11
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Got *FileNotFoundException* while running nutch dump.
> *Cause*: Character '?' in file name/extension producing the below error.
> *Error Details*
> java.io.FileNotFoundException: 
> /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg?
>  (Invalid argument)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222)
> at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962594#comment-14962594
 ] 

Hudson commented on NUTCH-2129:
---

SUCCESS: Integrated in Nutch-trunk #3291 (See 
[https://builds.apache.org/job/Nutch-trunk/3291/])
Fix for NUTCH-2129 - Add protocol status tracking to crawl datum contributed by 
Michael Joyce  this closes #68. (mattmann: 
[http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1709306])
* trunk/CHANGES.txt
* trunk/src/java/org/apache/nutch/metadata/Nutch.java
* 
trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
* trunk/src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java


> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-18 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962593#comment-14962593
 ] 

Hudson commented on NUTCH-2141:
---

SUCCESS: Integrated in Nutch-trunk #3291 (See 
[https://builds.apache.org/job/Nutch-trunk/3291/])
Fix for NUTCH-2141: Change the InteractiveSelenium plugin handler Interface to 
return page content contributed by Balaji  this closes #77 
#75 (mattmann: [http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1709307])
* trunk/CHANGES.txt
* 
trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
* 
trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java
* 
trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java
* 
trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java
* 
trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java


> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>Assignee: Chris A. Mattmann
>  Labels: selenium
> Fix For: 1.11
>
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[DISCUSS] Release 1.11 RC #1 (70 issues fixed)

2015-10-18 Thread Mattmann, Chris A (3980)
Hey Folks,

I’ll cut a 1.11 RC #1 today. We have 70 issues fixed, and I think
it would be a great time to release.

Going to try for a Tika 1.11 release candidate 1 today too.

Cheers,
Chris

++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: chris.a.mattm...@nasa.gov
WWW:  http://sunset.usc.edu/~mattmann/
++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++





[jira] [Updated] (NUTCH-2133) Transfer Selenium Documentation to WIki

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2133:
-
Fix Version/s: (was: 1.11)
   (was: 2.4)
   1.12

> Transfer Selenium Documentation to WIki
> ---
>
> Key: NUTCH-2133
> URL: https://issues.apache.org/jira/browse/NUTCH-2133
> Project: Nutch
>  Issue Type: Improvement
>  Components: documentation
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
> Fix For: 1.12
>
>
> There's a decent chunk of Selenium related documentation stuck in READMEs for 
> various plugins. I would be nice to get this stuff pushed to the wiki.
> E.G.: 
> https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-selenium/README.md



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2030) ParseZip plugin is not able to extract language from zip document,this could solve that problem.

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2030:
-
Fix Version/s: (was: 1.11)
   1.12

> ParseZip plugin is not able to extract language from zip document,this could 
> solve that problem.
> 
>
> Key: NUTCH-2030
> URL: https://issues.apache.org/jira/browse/NUTCH-2030
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
> Environment: Linux Mint 17 qiana, 4 GB Ram,Core I3.
>Reporter: Eyeris Rodriguez Rueda
>Priority: Minor
> Fix For: 1.12
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> Actually parse-zip plugin don´t extract language from zip document, therefore 
> lang field is empty in solr or elastic. If the package(.zip) contains a list 
> of documents so the lang field could be multivalued to support that list of 
> languages. A simple change to parse-zip pluging could fix this problem. I 
> will use Language Identifier class from tika and analyze each document inside.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2086) Nutch 1.X Webui

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2086:
-
Fix Version/s: (was: 1.11)
   1.12

> Nutch 1.X Webui 
> 
>
> Key: NUTCH-2086
> URL: https://issues.apache.org/jira/browse/NUTCH-2086
> Project: Nutch
>  Issue Type: New Feature
>  Components: REST_api, web gui
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2086.patch
>
>
> To port the Apache Wicket based webui in Nutch 2.X to 1.X



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2132:
-
Fix Version/s: (was: 1.11)
   1.12

> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2140) Atomic update and optimistic concurrency update using Solr

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2140:
-
Fix Version/s: (was: 1.11)
   1.12

> Atomic update and optimistic concurrency update using Solr
> --
>
> Key: NUTCH-2140
> URL: https://issues.apache.org/jira/browse/NUTCH-2140
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.9
>Reporter: Roannel Fernández Hernández
> Fix For: 1.12
>
>
> The SOLRIndexWriter plugin allows to index the documents into a Solr server. 
> The plugin replaces the documents that already are indexed into Solr. 
> Sometimes, replace only one field or add new fields and keep the others 
> values of the documents indexed is useful.
> Solr supports two approaches for this task: Atomic update and optimistic 
> concurrency update. However, the SOLRIndexWriter plugin doesn't support that 
> approaches.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2120) Remove MapWritable from trunk codebase

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2120:
-
Fix Version/s: (was: 1.11)
   1.12

> Remove MapWritable from trunk codebase
> --
>
> Key: NUTCH-2120
> URL: https://issues.apache.org/jira/browse/NUTCH-2120
> Project: Nutch
>  Issue Type: Bug
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.12
>
>
> [MapWritable|http://nutch.apache.org/apidocs/apidocs-1.10/index.html?org/apache/nutch/crawl/MapWritable.htm]
>  has been deprecated for a good while.
> We should remove it from the codebase and make sure we are not using it 
> anywhere (I don't think we are).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2139) Basic plugin to index inlinks and outlinks

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2139:
-
Fix Version/s: (was: 1.11)
   1.12

> Basic plugin to index inlinks and outlinks
> --
>
> Key: NUTCH-2139
> URL: https://issues.apache.org/jira/browse/NUTCH-2139
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Reporter: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: link, plugin
> Fix For: 1.12
>
>
> Basic plugin that allows to index the inlinks and outlinks of the web pages, 
> this could be very useful for analytic purposes, including neat 
> visualizations using d3.js for instance. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2122) Implement Javadoc package.html for service packages

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2122:
-
Fix Version/s: (was: 1.11)
   1.12

> Implement Javadoc package.html for service packages
> ---
>
> Key: NUTCH-2122
> URL: https://issues.apache.org/jira/browse/NUTCH-2122
> Project: Nutch
>  Issue Type: Improvement
>  Components: nutch server
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Priority: Trivial
> Fix For: 1.12
>
>
> [~sujenshah] I noticed that the Javadoc does not contain package.html 
> displaying package level introductory Javadoc as every other package does.
> http://nutch.apache.org/apidocs/apidocs-1.10/index.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2135) Ant Eclipse build does not include protocol-interactiveselenium

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2135?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2135:
-
Fix Version/s: (was: 1.11)
   1.12

> Ant Eclipse build does not include protocol-interactiveselenium
> ---
>
> Key: NUTCH-2135
> URL: https://issues.apache.org/jira/browse/NUTCH-2135
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Reporter: Sujen Shah
>Priority: Minor
>  Labels: memex
> Fix For: 1.12
>
>
> target eclipse in the build.xml file does not include 
> protocol-interactiveselenium so while importing the project into eclipse, it 
> does not add that folder.  
> On adding that to the build file, I found that eclipse throws errors as the 
> package naming in classes belonging to the 
> org.apache.nutch.protocol.interactiveselenium.handlers is incomplete. 
> Have made both those changes in this PR. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2128) Refactor configuration end point

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2128:
-
Fix Version/s: (was: 1.11)
   1.12

> Refactor configuration end point
> 
>
> Key: NUTCH-2128
> URL: https://issues.apache.org/jira/browse/NUTCH-2128
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Sujen Shah
>Priority: Minor
> Fix For: 1.12
>
>
> To better define the endpoint to create a new configuration and add a new 
> endpoint to update a particular property value of a configuration. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-1943) Form authentication should not be global and ignore

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1943?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-1943:
-
Fix Version/s: (was: 1.11)
   1.12

> Form authentication should not be global and ignore 
> ---
>
> Key: NUTCH-1943
> URL: https://issues.apache.org/jira/browse/NUTCH-1943
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.10
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: memex
> Fix For: 1.12
>
>
> Taken from [~wastl-nagel]'s comments on NUTCH-827
> bq. the form authentication is global and ignores . So you have to 
> restrict your crawl to the form authentication pages only. Ideally, also form 
> authentication should be bound to a scope (one host, one URL prefix, etc.) 
> same as HTTP authentication.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2064) URLNormalizer basic to properly encode non-ASCII characters

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2064:
-
Fix Version/s: (was: 1.11)
   1.12

> URLNormalizer basic to properly encode non-ASCII characters
> ---
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch, NUTCH-2064-v3.patch, 
> NUTCH-2064.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2141.
--
   Resolution: Fixed
Fix Version/s: 1.11

Thanks [~BalaJira] [~jo...@apache.org] plenty to  improve on but a great start!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2141: Change 
the InteractiveSelenium plugin handler Interface to return page content 
contributed by Balaji  this closes #77 #75"
SendingCHANGES.txt
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultClickAllAjaxLinksHandler.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java
Sending
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java
Transmitting file data ..
Committed revision 1709307.
[chipotle:~/tmp/nutch1.11] mattmann% 
{noformat}


> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>Assignee: Chris A. Mattmann
>  Labels: selenium
> Fix For: 1.11
>
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Trunk

2015-10-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/75


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962574#comment-14962574
 ] 

ASF GitHub Bot commented on NUTCH-2141:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/77


> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>Assignee: Chris A. Mattmann
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for NUTCH-2141 contributed by Balaji Gurum...

2015-10-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/77


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Work started] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2141 started by Chris A. Mattmann.

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>Assignee: Chris A. Mattmann
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2129.
--
Resolution: Fixed

Thanks [~jo...@apache.org]!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2129 - Add 
protocol status tracking to crawl datum contributed by Michael Joyce 
 this closes #68."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/metadata/Nutch.java
Sending
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
Sending
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java
Transmitting file data 
Committed revision 1709306.
{noformat}


> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2141) Change the InteractiveSelenium plugin handler Interface to return page content

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2141:


Assignee: Chris A. Mattmann

> Change the InteractiveSelenium plugin handler Interface to return page content
> --
>
> Key: NUTCH-2141
> URL: https://issues.apache.org/jira/browse/NUTCH-2141
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Balaji Gurumurthy
>Assignee: Chris A. Mattmann
>  Labels: selenium
>
> The handler interface in the protocol-interactiveselenium plugin currently 
> provide methods to manipulate the page content and the HTTPResponse class 
> read's the page content from the driver. This limits the amount of HTML 
> content that could be returned to nutch.
> The processDriver method could return a String object instead. This is 
> particularly helpful  in cases such as handling pagination when multiple 
> pages' content can be appended and returned from the handler. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-18 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962572#comment-14962572
 ] 

ASF GitHub Bot commented on NUTCH-2129:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/68


> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2129 - Add protocol status tracking to c...

2015-10-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/68


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Work started] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2129 started by Chris A. Mattmann.

> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2142.
--
Resolution: Fixed

Thanks [~karanjeets]!

{noformat}
[chipotle:~/tmp/nutch1.11] mattmann% svn commit -m "Fix for NUTCH-2142: Nutch 
File Dump - FileNotFoundException (Invalid Argument) Error contributed by 
karanjeets  this closes #76."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/tools/FileDumper.java
Sendingsrc/java/org/apache/nutch/util/DumpFileUtil.java
Transmitting file data ...
Committed revision 1709304.
[chipotle:~/tmp/nutch1.11] mattmann% 
{noformat}


> Nutch File Dump - FileNotFoundException (Invalid Argument) Error
> 
>
> Key: NUTCH-2142
> URL: https://issues.apache.org/jira/browse/NUTCH-2142
> Project: Nutch
>  Issue Type: Bug
>  Components: tool, util
>Affects Versions: 1.10, 1.11
> Environment: Operating System - Linux (RHEL 6.2)
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: dump, nutch
> Fix For: 1.11
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Got *FileNotFoundException* while running nutch dump.
> *Cause*: Character '?' in file name/extension producing the below error.
> *Error Details*
> java.io.FileNotFoundException: 
> /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg?
>  (Invalid argument)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222)
> at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2129) Track Protocol Status in Crawl Datum

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2129?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2129:


Assignee: Chris A. Mattmann

> Track Protocol Status in Crawl Datum
> 
>
> Key: NUTCH-2129
> URL: https://issues.apache.org/jira/browse/NUTCH-2129
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3, 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
> Fix For: 2.4, 1.11
>
>
> It's become necessary on a few crawls that I run to get protocol status code 
> stats. After speaking with [~lewismc] it seemed that there might not be a 
> super convenient way of doing this as is, but it would be great to be able to 
> add the functionality necessary to pull this information out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fixed FileNotFoundException (Invalid Argument)...

2015-10-18 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/76


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Work started] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2142 started by Chris A. Mattmann.

> Nutch File Dump - FileNotFoundException (Invalid Argument) Error
> 
>
> Key: NUTCH-2142
> URL: https://issues.apache.org/jira/browse/NUTCH-2142
> Project: Nutch
>  Issue Type: Bug
>  Components: tool, util
>Affects Versions: 1.10, 1.11
> Environment: Operating System - Linux (RHEL 6.2)
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: dump, nutch
> Fix For: 1.11
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Got *FileNotFoundException* while running nutch dump.
> *Cause*: Character '?' in file name/extension producing the below error.
> *Error Details*
> java.io.FileNotFoundException: 
> /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg?
>  (Invalid argument)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222)
> at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error

2015-10-18 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2142:


Assignee: Chris A. Mattmann

> Nutch File Dump - FileNotFoundException (Invalid Argument) Error
> 
>
> Key: NUTCH-2142
> URL: https://issues.apache.org/jira/browse/NUTCH-2142
> Project: Nutch
>  Issue Type: Bug
>  Components: tool, util
>Affects Versions: 1.10, 1.11
> Environment: Operating System - Linux (RHEL 6.2)
>Reporter: Karanjeet Singh
>Assignee: Chris A. Mattmann
>  Labels: dump, nutch
> Fix For: 1.11
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Got *FileNotFoundException* while running nutch dump.
> *Cause*: Character '?' in file name/extension producing the below error.
> *Error Details*
> java.io.FileNotFoundException: 
> /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg?
>  (Invalid argument)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222)
> at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2142) Nutch File Dump - FileNotFoundException (Invalid Argument) Error

2015-10-18 Thread Karanjeet Singh (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14962295#comment-14962295
 ] 

Karanjeet Singh commented on NUTCH-2142:


This has been completed under GitHub pull request 
(https://github.com/apache/nutch/pull/76)

> Nutch File Dump - FileNotFoundException (Invalid Argument) Error
> 
>
> Key: NUTCH-2142
> URL: https://issues.apache.org/jira/browse/NUTCH-2142
> Project: Nutch
>  Issue Type: Bug
>  Components: tool, util
>Affects Versions: 1.10, 1.11
> Environment: Operating System - Linux (RHEL 6.2)
>Reporter: Karanjeet Singh
>  Labels: dump, nutch
> Fix For: 1.11
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Got *FileNotFoundException* while running nutch dump.
> *Cause*: Character '?' in file name/extension producing the below error.
> *Error Details*
> java.io.FileNotFoundException: 
> /media/PATRO/Karan/nutch_12Oct/other_gun_urls/img/99/fb/97d3980f9954b597f372d092b97eff22_27tlt_recon_1_black_g_10_handle_.jpeg?
>  (Invalid argument)
> at java.io.FileOutputStream.open(Native Method)
> at java.io.FileOutputStream.(FileOutputStream.java:221)
> at java.io.FileOutputStream.(FileOutputStream.java:171)
> at org.apache.nutch.tools.FileDumper.dump(FileDumper.java:222)
> at org.apache.nutch.tools.FileDumper.main(FileDumper.java:325)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)