Build failed in Jenkins: Nutch-trunk #3123

2015-05-15 Thread Apache Jenkins Server
See 

--
[...truncated 6434 lines...]
copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-ajax
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.015 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.253 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 7, Failures: 0, Errors: 1, Time elapsed: 62.987 sec
[junit] Test org.apache.nutch.protocol.httpclient.TestProtocolHttpClient 
FAILED

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.524 sec
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-querystring
[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/home/jenkins/tools/ant/latest/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:
[junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.275 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.querystring.TestQuerystringURLNormalizer

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex

deps-test-

[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14546324#comment-14546324
 ] 

Chris A. Mattmann commented on NUTCH-2011:
--

All great points [~wastl-nagel]. [~sujenshah] can you please review them and 
let's work on updating this to address it, piece by piece. We'll get it done.

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2011:


Sorry, but this needs some rework:
- after 35.000+ fetched pages and the default max. heap size of 1000M fetcher 
becomes slow and throws mainly parser timeouts and catched OOM exceptions. Only 
small HTML pages with few outlinks per page have been crawled - the limit is 
reached sooner if there are many overlong outlinks or big PDF documents.
- why an in-memory "database" of page-related information (URL, title, outlinks 
+ anchor texts)?
-- all information is available in CrawlDb, LinkDb, segments
-- MapReduce job counters provide instant progress information (e.g, number of 
fetched pages)
-- if required a queue of limited total size should be used
- in any case, this feature should be optional and off per default if 
NutchServer is not used
- "reporting" to FetchNodeDb is off if fetcher.parse is false (the default)? Is 
this intended? Construction of FetchNodes is then useless work.
- no traces to System.out: "FetchNodeDb : putting node ..."

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2014) Fetcher hang-up on completion

2015-05-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2014:
---
  Component/s: fetcher
   Patch Info: Patch Available
Affects Version/s: 1.11
Fix Version/s: 1.11

> Fetcher hang-up on completion
> -
>
> Key: NUTCH-2014
> URL: https://issues.apache.org/jira/browse/NUTCH-2014
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.11
>
> Attachments: NUTCH-2014-v1.patch
>
>
> Although fetcher has done its work it does not shut down and exit but 
> continues to log (and before reports its status to the task tracker):
> {noformat}
> -activeThreads=11, spinWaiting=0, fetchQueues.totalSize=33, 
> fetchQueues.getQueueCount=1
> -activeThreads=11, spinWaiting=10, fetchQueues.totalSize=26, 
> fetchQueues.getQueueCount=1
> -activeThreads=11, spinWaiting=9, fetchQueues.totalSize=9, 
> fetchQueues.getQueueCount=1
> -activeThreads=9, spinWaiting=7, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=0
> ...
> (last message continues)
> {noformat}
> A possible hint: activeThreads should never exceed 10 (configured per 
> default). Looks like the corresponding variable is lost/mixed-up during 
> fetcher refactorization (NUTCH-1934).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2014) Fetcher hang-up on completion

2015-05-15 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2014?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2014:
---
Attachment: NUTCH-2014-v1.patch

The reason is a mix-up of the counters for active threads and fetch errors: the 
former counter was incremented to 11 just after an error according to the logs:
{noformat}
2015-05-15 21:51:37,399 INFO  fetcher.Fetcher - -activeThreads=10, ...
...
2015-05-15 21:51:38,279 INFO  fetcher.FetcherThread - fetch of ... failed with: 
...
...
2015-05-15 21:51:38,399 INFO  fetcher.Fetcher - -activeThreads=11, ...
{noformat}

> Fetcher hang-up on completion
> -
>
> Key: NUTCH-2014
> URL: https://issues.apache.org/jira/browse/NUTCH-2014
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.11
>
> Attachments: NUTCH-2014-v1.patch
>
>
> Although fetcher has done its work it does not shut down and exit but 
> continues to log (and before reports its status to the task tracker):
> {noformat}
> -activeThreads=11, spinWaiting=0, fetchQueues.totalSize=33, 
> fetchQueues.getQueueCount=1
> -activeThreads=11, spinWaiting=10, fetchQueues.totalSize=26, 
> fetchQueues.getQueueCount=1
> -activeThreads=11, spinWaiting=9, fetchQueues.totalSize=9, 
> fetchQueues.getQueueCount=1
> -activeThreads=9, spinWaiting=7, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=0
> ...
> (last message continues)
> {noformat}
> A possible hint: activeThreads should never exceed 10 (configured per 
> default). Looks like the corresponding variable is lost/mixed-up during 
> fetcher refactorization (NUTCH-1934).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2014) Fetcher hang-up on completion

2015-05-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2014:
--

 Summary: Fetcher hang-up on completion
 Key: NUTCH-2014
 URL: https://issues.apache.org/jira/browse/NUTCH-2014
 Project: Nutch
  Issue Type: Bug
Reporter: Sebastian Nagel
Priority: Critical


Although fetcher has done its work it does not shut down and exit but continues 
to log (and before reports its status to the task tracker):
{noformat}
-activeThreads=11, spinWaiting=0, fetchQueues.totalSize=33, 
fetchQueues.getQueueCount=1
-activeThreads=11, spinWaiting=10, fetchQueues.totalSize=26, 
fetchQueues.getQueueCount=1
-activeThreads=11, spinWaiting=9, fetchQueues.totalSize=9, 
fetchQueues.getQueueCount=1
-activeThreads=9, spinWaiting=7, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=1
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
-activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
fetchQueues.getQueueCount=0
...
(last message continues)
{noformat}

A possible hint: activeThreads should never exceed 10 (configured per default). 
Looks like the corresponding variable is lost/mixed-up during fetcher 
refactorization (NUTCH-1934).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2013) Fetcher: missing logs "fetching ..." on stdout

2015-05-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2013:
--

 Summary: Fetcher: missing logs "fetching ..." on stdout
 Key: NUTCH-2013
 URL: https://issues.apache.org/jira/browse/NUTCH-2013
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.11
Reporter: Sebastian Nagel
 Fix For: 1.11


When running Fetcher no messages {{fetching ...}} do appear on stdout, there 
are only {{-activeThreads=10, ...}} messages. This is caused by the refactoring 
of Fetcher in NUTCH-1934:
* logging class is now FetchThread but it is not configured to log to stdout in 
log4j.properties
* alternatively, FetcherTread's LOG could still be obtained from Fetcher.class
* other refactored classes could be affected as well (FetchItem, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545988#comment-14545988
 ] 

Hudson commented on NUTCH-2011:
---

SUCCESS: Integrated in Nutch-trunk #3122 (See 
[https://builds.apache.org/job/Nutch-trunk/3122/])
NUTCH-2011 Endpoint to support realtime JSON output from the fetcher: 
Contributed by Sujen Shah  this closes #24. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1679613)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/fetcher/FetchNode.java
* /nutch/trunk/src/java/org/apache/nutch/fetcher/FetchNodeDb.java
* /nutch/trunk/src/java/org/apache/nutch/fetcher/FetcherThread.java
* /nutch/trunk/src/java/org/apache/nutch/service/NutchServer.java
* 
/nutch/trunk/src/java/org/apache/nutch/service/model/response/FetchNodeDbInfo.java
* /nutch/trunk/src/java/org/apache/nutch/service/resources/DbResource.java


> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Branch 1.6

2015-05-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/22


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Resolved] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2011.
--
Resolution: Fixed

Committed!

Thank you [~sujenshah]

{noformat}
[mattmann-0420740:~/tmp/nutch-trunk] mattmann% svn commit -m "NUTCH-2011 
Endpoint to support realtime JSON output from the fetcher: Contributed by Sujen 
Shah  this closes #24."
SendingCHANGES.txt
Adding src/java/org/apache/nutch/fetcher/FetchNode.java
Adding src/java/org/apache/nutch/fetcher/FetchNodeDb.java
Sendingsrc/java/org/apache/nutch/fetcher/FetcherThread.java
Sendingsrc/java/org/apache/nutch/service/NutchServer.java
Adding 
src/java/org/apache/nutch/service/model/response/FetchNodeDbInfo.java
Sendingsrc/java/org/apache/nutch/service/resources/DbResource.java
Transmitting file data ...
Committed revision 1679613.
[mattmann-0420740:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for Nutch-2011 contributed b Sujen Shah

2015-05-15 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/24


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545890#comment-14545890
 ] 

Chris A. Mattmann commented on NUTCH-2011:
--

all tests pass!
{noformat}
copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-slash
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/Users/mattmann/tmp/nutch-trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.685 sec

test:

BUILD SUCCESSFUL
Total time: 8 minutes 48 seconds
[mattmann-0420740:~/tmp/nutch-trunk] mattmann% 
{noformat}

Going to commit now.


> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545771#comment-14545771
 ] 

Chris A. Mattmann commented on NUTCH-2011:
--

put some comments on the PR, they have been addressed. Tested it out, seems to 
work OK. Will commit this now.

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2011:
-
Fix Version/s: 1.11

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2011:


Assignee: Chris A. Mattmann

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2011:
-
Component/s: REST_api
 fetcher

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2011:
-
Labels: memex  (was: )

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2011 started by Chris A. Mattmann.

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: fix for Nutch-2011 contributed b Sujen Shah

2015-05-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/24#discussion_r30415404
  
--- Diff: src/java/org/apache/nutch/fetcher/FetchNodeDb.java ---
@@ -0,0 +1,53 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.fetcher;
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+
+public class FetchNodeDb {
+
+  private Map fetchNodeDbMap;
+  private int index;
+  private static FetchNodeDb fetchNodeDbInstance = null;
+  
+  public FetchNodeDb(){
+//System.out.println("Calling FetchNode constructor");
+fetchNodeDbMap = new ConcurrentHashMap();
+index = 1;
+  }
+  
+  public static FetchNodeDb getInstance(){
+
+if(fetchNodeDbInstance == null){
+//  System.out.println("Creating FetchNode instance");
--- End diff --

remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for Nutch-2011 contributed b Sujen Shah

2015-05-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/24#discussion_r30415409
  
--- Diff: src/java/org/apache/nutch/fetcher/FetchNodeDb.java ---
@@ -0,0 +1,53 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.fetcher;
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+
+public class FetchNodeDb {
+
+  private Map fetchNodeDbMap;
+  private int index;
+  private static FetchNodeDb fetchNodeDbInstance = null;
+  
+  public FetchNodeDb(){
+//System.out.println("Calling FetchNode constructor");
+fetchNodeDbMap = new ConcurrentHashMap();
+index = 1;
+  }
+  
+  public static FetchNodeDb getInstance(){
+
+if(fetchNodeDbInstance == null){
+//  System.out.println("Creating FetchNode instance");
+  fetchNodeDbInstance = new FetchNodeDb();
+}
+//System.out.println("FetchNodeDb Instance : " + fetchNodeDbInstance);
--- End diff --

remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for Nutch-2011 contributed b Sujen Shah

2015-05-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/24#discussion_r30415465
  
--- Diff: 
src/java/org/apache/nutch/service/model/response/FetchNodeDbInfo.java ---
@@ -0,0 +1,87 @@
+package org.apache.nutch.service.model.response;
--- End diff --

ALv2 header


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for Nutch-2011 contributed b Sujen Shah

2015-05-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/24#discussion_r30415390
  
--- Diff: src/java/org/apache/nutch/fetcher/FetchNodeDb.java ---
@@ -0,0 +1,53 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.fetcher;
+
+import java.util.Map;
+import java.util.concurrent.ConcurrentHashMap;
+
+
+public class FetchNodeDb {
+
+  private Map fetchNodeDbMap;
+  private int index;
+  private static FetchNodeDb fetchNodeDbInstance = null;
+  
+  public FetchNodeDb(){
+//System.out.println("Calling FetchNode constructor");
--- End diff --

remove


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for Nutch-2011 contributed b Sujen Shah

2015-05-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/24#discussion_r30415367
  
--- Diff: src/java/org/apache/nutch/fetcher/FetchNode.java ---
@@ -0,0 +1,62 @@
+package org.apache.nutch.fetcher;
+
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.parse.Outlink;
+
+public class FetchNode {
+  private Text url = null;
+  private Outlink[] outlinks;
+  private int status = 0;
+  private String title = null;
+  private long fetchTime = 0;
+  
+  public Text getUrl() {
+return url;
+  }
+  public void setUrl(Text url) {
+System.out.println(this.hashCode() + " Setting url to : " + 
url.toString());
+this.url = url;
+  }
+  public Outlink[] getOutlinks() {
+return outlinks;
+  }
+  public void setOutlinks(Outlink[] links) {
+System.out.println(this.hashCode() + " Setting outlinks to : " + 
links.length);
+this.outlinks = links;
+  }
+  public int getStatus() {
+return status;
+  }
+  public void setStatus(int status) {
+System.out.println(this.hashCode() + " Setting status to : " + status);
+this.status = status;
+  }
+  public String getTitle() {
+return title;
+  }
+  public void setTitle(String title) {
+System.out.println(this.hashCode() + " Setting title to : " + title);
+this.title = title;
+  }
+  public long getFetchTime() {
+return fetchTime;
+  }
+  public void setFetchTime(long fetchTime) {
+System.out.println(this.hashCode() + " Setting fetchTime to : " + 
fetchTime);
+this.fetchTime = fetchTime;
+  }
+  
+//  public String toString(){
--- End diff --

If it's commented out, don't include it please :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] nutch pull request: fix for Nutch-2011 contributed b Sujen Shah

2015-05-15 Thread chrismattmann
Github user chrismattmann commented on a diff in the pull request:

https://github.com/apache/nutch/pull/24#discussion_r30415342
  
--- Diff: src/java/org/apache/nutch/fetcher/FetchNode.java ---
@@ -0,0 +1,62 @@
+package org.apache.nutch.fetcher;
--- End diff --

Can you add ALv2 headers?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545618#comment-14545618
 ] 

Hudson commented on NUTCH-2006:
---

SUCCESS: Integrated in Nutch-trunk #3121 (See 
[https://builds.apache.org/job/Nutch-trunk/3121/])
NUTCH-2006 IndexingFiltersChecker to take custom metadata as input (jnioche) 
(jnioche: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1679567)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java


> IndexingFiltersChecker  to take custom metadata as input
> 
>
> Key: NUTCH-2006
> URL: https://issues.apache.org/jira/browse/NUTCH-2006
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-2006.patch
>
>
> Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Sujen Shah (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545603#comment-14545603
 ] 

Sujen Shah commented on NUTCH-2011:
---

PR link - https://github.com/apache/nutch/pull/24

> Endpoint to support realtime JSON output from the fetcher
> -
>
> Key: NUTCH-2011
> URL: https://issues.apache.org/jira/browse/NUTCH-2011
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Sujen Shah
>
> This fix will create an endpoint to query the Nutch REST service and get a 
> real-time JSON response of the current/past Fetched URLs. 
> This endpoint also includes pagination of the output to reduce data transfer 
> bw in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche resolved NUTCH-2006.
--
   Resolution: Fixed
Fix Version/s: 1.11

Committed revision 1679567.

Thanks Seb

> IndexingFiltersChecker  to take custom metadata as input
> 
>
> Key: NUTCH-2006
> URL: https://issues.apache.org/jira/browse/NUTCH-2006
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Julien Nioche
>Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-2006.patch
>
>
> Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2012) Merge parsechecker and indexchecker

2015-05-15 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545534#comment-14545534
 ] 

Julien Nioche commented on NUTCH-2012:
--

+1 to merging them into a more generic tool. Most of the code in these 2 
classes is the same. We could add a few options e.g. not to display the fields 
generated for the indexing

> Merge parsechecker and indexchecker
> ---
>
> Key: NUTCH-2012
> URL: https://issues.apache.org/jira/browse/NUTCH-2012
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.11
>
>
> ParserChecker and IndexingFiltersChecker have evolved from simple tools to 
> check parsers and parsefilters resp. indexing filters to powerful tools which 
> emulate the crawling of a single URL/document:
> - check robots.txt (NUTCH-2002)
> - follow redirects (NUTCH-2004)
> Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also 
> NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge 
> them
> * either into one general debugging tool, keeping parsechecker and 
> indexchecker as aliases
> * centralize common code in one utility class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2002) ParserChecker to check robots.txt

2015-05-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545532#comment-14545532
 ] 

Sebastian Nagel commented on NUTCH-2002:


one point: also redirects should be checked for robots.txt (after NUTCH-2004)

> ParserChecker to check robots.txt
> -
>
> Key: NUTCH-2002
> URL: https://issues.apache.org/jira/browse/NUTCH-2002
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-2002.patch
>
>
> ParserChecker could check whether a given URL is allowed by the robots.txt 
> directives.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2006) IndexingFiltersChecker to take custom metadata as input

2015-05-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545527#comment-14545527
 ] 

Sebastian Nagel commented on NUTCH-2006:


+1 to complete indexchecker (opened NUTCH-2012 to avoid forgotten features in 
one of the two checkers, for the future)

> IndexingFiltersChecker  to take custom metadata as input
> 
>
> Key: NUTCH-2006
> URL: https://issues.apache.org/jira/browse/NUTCH-2006
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-2006.patch
>
>
> Similar to [NUTCH-1757] but for IndexingFiltersChecker.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2012) Merge parsechecker and indexchecker

2015-05-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2012:
--

 Summary: Merge parsechecker and indexchecker
 Key: NUTCH-2012
 URL: https://issues.apache.org/jira/browse/NUTCH-2012
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.10
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.11


ParserChecker and IndexingFiltersChecker have evolved from simple tools to 
check parsers and parsefilters resp. indexing filters to powerful tools which 
emulate the crawling of a single URL/document:
- check robots.txt (NUTCH-2002)
- follow redirects (NUTCH-2004)

Keeping both tools in sync takes extra work (cf. NUTCH-1757/NUTCH-2006, also 
NUTCH-2002, NUTCH-2004 are done only for parsechecker). It's time to merge them
* either into one general debugging tool, keeping parsechecker and indexchecker 
as aliases
* centralize common code in one utility class



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2002) ParserChecker to check robots.txt

2015-05-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14545473#comment-14545473
 ] 

Sebastian Nagel commented on NUTCH-2002:


+1 makes ParserChecker a more powerful debugging tool
* it's possible to pass the agent name from command-line: {{bin/nutch 
parsechecker -Dhttp.agent.name=myBot ...}}
* and to disable the robots check via 
{{-Dhttp.robot.rules.whitelist=myhost.net}} (NUTCH-1927)

> ParserChecker to check robots.txt
> -
>
> Key: NUTCH-2002
> URL: https://issues.apache.org/jira/browse/NUTCH-2002
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.9
>Reporter: Julien Nioche
>Priority: Minor
> Attachments: NUTCH-2002.patch
>
>
> ParserChecker could check whether a given URL is allowed by the robots.txt 
> directives.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2011) Endpoint to support realtime JSON output from the fetcher

2015-05-15 Thread Sujen Shah (JIRA)
Sujen Shah created NUTCH-2011:
-

 Summary: Endpoint to support realtime JSON output from the fetcher
 Key: NUTCH-2011
 URL: https://issues.apache.org/jira/browse/NUTCH-2011
 Project: Nutch
  Issue Type: Sub-task
Reporter: Sujen Shah


This fix will create an endpoint to query the Nutch REST service and get a 
real-time JSON response of the current/past Fetched URLs. 

This endpoint also includes pagination of the output to reduce data transfer bw 
in large crawls.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)