Jenkins build is back to normal : Nutch-trunk #3239

2015-08-02 Thread Apache Jenkins Server
See 



[jira] [Commented] (NUTCH-2059) protocol-httpclient, protocol-http unit test errors on Jenkins

2015-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651315#comment-14651315
 ] 

Chris A. Mattmann commented on NUTCH-2059:
--

we have a failed build - 
https://builds.apache.org/job/Nutch-trunk/3238/testReport/junit/org.apache.nutch.fetcher/TestFetcher/testFetch/
 related?

> protocol-httpclient, protocol-http unit test errors on Jenkins
> --
>
> Key: NUTCH-2059
> URL: https://issues.apache.org/jira/browse/NUTCH-2059
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Peter Ciuffetti
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> This is an occasional error on the build of the Nutch trunk visible in 
> Jenkins builds.  It happens on either protocol-http or protocol-httpclient, 
> which can be running at the same time given the multi-threaded test setup.
> {code}
> [junit] Running org.apache.nutch.protocol.httpclient.TestProtocolHttpClient
> [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.377 
> sec
> [junit] Test org.apache.nutch.protocol.http.TestProtocolHttp FAILED
> {code}
> Evidence of failure on Jenkins go back to
> Failed > Console Output  #3154Jun 8, 2015 4:00:00 AM
> https://builds.apache.org/view/All/job/Nutch-trunk/3154/consoleFull
> And are repeated at...
> https://builds.apache.org/view/All/job/Nutch-trunk/3190/console
> https://builds.apache.org/view/All/job/Nutch-trunk/3189/console
> Some possibly related tickets
> NUTCH-1836 Timeouts in protocol-httpclient when crawling same host with >2 
> threads 
> NUTCH-1086 Rewrite protocol-httpclient
> The unit tests are not failing for me on my sandbox, but there are some 
> exceptions being output to the log related to headers being sent on JSP pages 
> after the response writer is invoked.
> {code}
> java.lang.IllegalStateException: STREAM
> at org.mortbay.jetty.Response.getWriter(Response.java:616)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Build failed in Jenkins: Nutch-trunk #3238

2015-08-02 Thread Apache Jenkins Server
See 

Changes:

[mattmann] Fix for NUTCH-2066: Parameterize Generate REST endpoint contributed 
by Sujen Shah  this closes #47.

--
[...truncated 4272 lines...]
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass
[javac] Compiling 2 source files to 

[javac] Creating empty 


jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring
[javac] Compiling 2 source files to 

[javac] Creating empty 


jar:
  [jar] Building jar: 


deps-test:

deploy:
 [copy] Copying 1 file to 


copy-generated-lib:
 [copy] Copying 1 file to 

[mkdir] Created dir: 

 [copy] Copying 4 files to 


init:
[mkdir] Created dir: 

[mkdir] Created dir: 

[mkdir] Created dir: 


init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex
[javac] Compiling 2 source files to 

[javac] Creating empty 


jar:
  [jar] Building jar: 


deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


[jira] [Commented] (NUTCH-2066) Parameterize Generate REST endpoint

2015-08-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651314#comment-14651314
 ] 

Hudson commented on NUTCH-2066:
---

FAILURE: Integrated in Nutch-trunk #3238 (See 
[https://builds.apache.org/job/Nutch-trunk/3238/])
Fix for NUTCH-2066: Parameterize Generate REST endpoint contributed by Sujen 
Shah  this closes #47. (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1693844)
* /nutch/trunk/CHANGES.txt
* /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java


> Parameterize Generate REST endpoint
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Allow user to specify crawldb and segment db in the Generate Job REST 
> endpoint 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2066) Parameterize Generate REST endpoint

2015-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651300#comment-14651300
 ] 

ASF GitHub Bot commented on NUTCH-2066:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/47


> Parameterize Generate REST endpoint
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Allow user to specify crawldb and segment db in the Generate Job REST 
> endpoint 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2066) Parameterize Generate REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2066.
--
Resolution: Fixed

Committed to trunk:

{noformat}
[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m "Fix for NUTCH-2066: 
Parameterize Generate REST endpoint contributed by Sujen Shah 
 this closes #47."
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/crawl/Generator.java
Transmitting file data ..
Committed revision 1693844.
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Parameterize Generate REST endpoint
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Allow user to specify crawldb and segment db in the Generate Job REST 
> endpoint 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2066 contributed by Sujen Shah

2015-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/47


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2066) Parameterize Generate REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651297#comment-14651297
 ] 

Chris A. Mattmann commented on NUTCH-2066:
--

All tests pass:

{noformat}

test:
 [echo] Testing plugin: urlnormalizer-slash
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/Users/mattmann/tmp/nutch-trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.886 sec
[junit] Running 
org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.552 sec

BUILD SUCCESSFUL
Total time: 11 minutes 11 seconds
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Parameterize Generate REST endpoint
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Allow user to specify crawldb and segment db in the Generate Job REST 
> endpoint 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651294#comment-14651294
 ] 

Hudson commented on NUTCH-2072:
---

SUCCESS: Integrated in Nutch-trunk #3237 (See 
[https://builds.apache.org/job/Nutch-trunk/3237/])
Fix for NUTCH-2072: Deflate encoding support is broken when http.content.limit 
is set to -1 contributed by Tanguy Moal  this closes #48. 
(mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1693843)
* /nutch/trunk/CHANGES.txt
* 
/nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java


> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-08-02 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651295#comment-14651295
 ] 

Hudson commented on NUTCH-2062:
---

SUCCESS: Integrated in Nutch-trunk #3237 (See 
[https://builds.apache.org/job/Nutch-trunk/3237/])
Changes for NUTCH-2062 (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1693838)
* /nutch/trunk/CHANGES.txt
Fix for NUTCH-2062: Add Plugin for interacting with Selenium WebDriver 
contributed by Michael Joyce  this closes #46 (mattmann: 
http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1693837)
* /nutch/trunk/build.xml
* /nutch/trunk/conf/nutch-default.xml
* /nutch/trunk/src/plugin/build.xml
* 
/nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
* /nutch/trunk/src/plugin/protocol-interactiveselenium
* /nutch/trunk/src/plugin/protocol-interactiveselenium/README.md
* /nutch/trunk/src/plugin/protocol-interactiveselenium/build-ivy.xml
* /nutch/trunk/src/plugin/protocol-interactiveselenium/build.xml
* /nutch/trunk/src/plugin/protocol-interactiveselenium/ivy.xml
* /nutch/trunk/src/plugin/protocol-interactiveselenium/plugin.xml
* /nutch/trunk/src/plugin/protocol-interactiveselenium/src
* /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java
* /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org
* /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache
* /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/Http.java
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java
* 
/nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/package.html


> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2066) Parameterize Generate REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2066:
-
Description: Allow user to specify crawldb and segment db in the Generate 
Job REST endpoint 

> Parameterize Generate REST endpoint
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>
> Allow user to specify crawldb and segment db in the Generate Job REST 
> endpoint 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2066) Allow user to specify crawldb and segment db in the Generate JOb REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2066 started by Chris A. Mattmann.

> Allow user to specify crawldb and segment db in the Generate JOb REST 
> endpoint 
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2066) Allow user to specify crawldb and segment db in the Generate Job REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2066:
-
Labels: memex  (was: )

> Allow user to specify crawldb and segment db in the Generate Job REST 
> endpoint 
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2066) Parameterize Generate REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2066:
-
Summary: Parameterize Generate REST endpoint  (was: Allow user to specify 
crawldb and segment db in the Generate Job REST endpoint )

> Parameterize Generate REST endpoint
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2066) Allow user to specify crawldb and segment db in the Generate Job REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2066:
-
Summary: Allow user to specify crawldb and segment db in the Generate Job 
REST endpoint   (was: Allow user to specify crawldb and segment db in the 
Generate JOb REST endpoint )

> Allow user to specify crawldb and segment db in the Generate Job REST 
> endpoint 
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2066) Allow user to specify crawldb and segment db in the Generate JOb REST endpoint

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2066:


Assignee: Chris A. Mattmann

> Allow user to specify crawldb and segment db in the Generate JOb REST 
> endpoint 
> ---
>
> Key: NUTCH-2066
> URL: https://issues.apache.org/jira/browse/NUTCH-2066
> Project: Nutch
>  Issue Type: Sub-task
>  Components: REST_api
>Reporter: Sujen Shah
>Assignee: Chris A. Mattmann
>Priority: Minor
>  Labels: memex
> Fix For: 1.11
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651279#comment-14651279
 ] 

Chris A. Mattmann edited comment on NUTCH-2072 at 8/2/15 11:39 PM:
---

Fixed, thanks [~tanguy]!

{noformat}
[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m "Fix for NUTCH-2072: 
Deflate encoding support is broken when http.content.limit is set to -1 
contributed by Tanguy Moal  this closes #48."
SendingCHANGES.txt
Sending
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
Transmitting file data ..
Committed revision 1693843.
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}



was (Author: chrismattmann):
Fixed, thanks [~ltanguy]

{noformat}
[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m "Fix for NUTCH-2072: 
Deflate encoding support is broken when http.content.limit is set to -1 
contributed by Tanguy Moal  this closes #48."
SendingCHANGES.txt
Sending
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
Transmitting file data ..
Committed revision 1693843.
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2062.
--
Resolution: Fixed

Committed, thanks Mike!

> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann resolved NUTCH-2072.
--
Resolution: Fixed

Fixed, thanks [~ltanguy]

{noformat}
[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m "Fix for NUTCH-2072: 
Deflate encoding support is broken when http.content.limit is set to -1 
contributed by Tanguy Moal  this closes #48."
SendingCHANGES.txt
Sending
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java
Transmitting file data ..
Committed revision 1693843.
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651278#comment-14651278
 ] 

ASF GitHub Bot commented on NUTCH-2072:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/48


> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: Fix for NUTCH-2072

2015-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/48


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651277#comment-14651277
 ] 

Chris A. Mattmann commented on NUTCH-2072:
--

Tests pass:

{noformat}

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-slash
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/Users/mattmann/tmp/nutch-trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
2.055 sec
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
11.856 sec
[junit] Running org.apache.nutch.tika.TestRTFParser
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
0.125 sec
[junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
17.994 sec

BUILD SUCCESSFUL
Total time: 13 minutes 21 seconds
{noformat}

Committing this now. Thanks.


> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-2072 started by Chris A. Mattmann.

> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann updated NUTCH-2072:
-
Fix Version/s: 1.11

> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.11
>
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651266#comment-14651266
 ] 

Chris A. Mattmann commented on NUTCH-2062:
--

Thanks [~mjoyce]! All committed:

{noformat}
[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m "Fix for NUTCH-2062: Add 
Plugin for interacting with Selenium WebDriver contributed by Michael Joyce 
 this closes #46"
Sendingbuild.xml
Sendingconf/nutch-default.xml
Sendingsrc/plugin/build.xml
Sending
src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java
Adding src/plugin/protocol-interactiveselenium
Adding src/plugin/protocol-interactiveselenium/README.md
Adding src/plugin/protocol-interactiveselenium/build-ivy.xml
Adding src/plugin/protocol-interactiveselenium/build.xml
Adding src/plugin/protocol-interactiveselenium/ivy.xml
Adding src/plugin/protocol-interactiveselenium/plugin.xml
Adding src/plugin/protocol-interactiveselenium/src
Adding src/plugin/protocol-interactiveselenium/src/java
Adding src/plugin/protocol-interactiveselenium/src/java/org
Adding src/plugin/protocol-interactiveselenium/src/java/org/apache
Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/Http.java
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java
Adding 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/package.html
Transmitting file data ..
Committed revision 1693837.
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}


> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1

2015-08-02 Thread Chris A. Mattmann (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris A. Mattmann reassigned NUTCH-2072:


Assignee: Chris A. Mattmann

> Deflate encoding support is broken when http.content.limit is set to -1
> ---
>
> Key: NUTCH-2072
> URL: https://issues.apache.org/jira/browse/NUTCH-2072
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Tanguy Moal
>Assignee: Chris A. Mattmann
>Priority: Minor
>
> The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is 
> not designed to have sizeLimit set to a negative value.
> The fix can be simply to mimic what's done with gzip encoding : if 
> {{getMaxContent() < 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} 
> argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-08-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651265#comment-14651265
 ] 

ASF GitHub Bot commented on NUTCH-2062:
---

Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/46


> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver

2015-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651264#comment-14651264
 ] 

Chris A. Mattmann commented on NUTCH-2062:
--

{noformat}
test:
 [echo] Testing plugin: urlnormalizer-slash
[junit] WARNING: multiple versions of ant detected in path for junit 
[junit]  
jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class
[junit]  and 
jar:file:/Users/mattmann/tmp/nutch-trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class
[junit] Running 
org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer
[junit] Running 
org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.79 sec
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.929 sec

BUILD SUCCESSFUL
Total time: 12 minutes 11 seconds
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}

All tests passing, commiting this now.


> Add Plugin for interacting with Selenium WebDriver
> --
>
> Key: NUTCH-2062
> URL: https://issues.apache.org/jira/browse/NUTCH-2062
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Michael Joyce
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2062v2.patch
>
>
> The protocol-selenium plugin is great for pulling webpages that dynamically 
> load content. However, I've run into use cases where I need to actively 
> interact with a page in Selenium before it becomes useful. For instance, I 
> may need to paginate through a table to get all results that I'm interested 
> in. This plugin will handle that use case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[GitHub] nutch pull request: NUTCH-2062 - Interactive Selenium Plugin

2015-08-02 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/nutch/pull/46


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[jira] [Commented] (NUTCH-2059) protocol-httpclient, protocol-http unit test errors on Jenkins

2015-08-02 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651127#comment-14651127
 ] 

Chris A. Mattmann commented on NUTCH-2059:
--

ping thoughts here? Doesn't seem to be a broken build in a while but maybe we 
should push your updates regardless Peter?

> protocol-httpclient, protocol-http unit test errors on Jenkins
> --
>
> Key: NUTCH-2059
> URL: https://issues.apache.org/jira/browse/NUTCH-2059
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Reporter: Peter Ciuffetti
>Assignee: Chris A. Mattmann
> Fix For: 1.11
>
>
> This is an occasional error on the build of the Nutch trunk visible in 
> Jenkins builds.  It happens on either protocol-http or protocol-httpclient, 
> which can be running at the same time given the multi-threaded test setup.
> {code}
> [junit] Running org.apache.nutch.protocol.httpclient.TestProtocolHttpClient
> [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.377 
> sec
> [junit] Test org.apache.nutch.protocol.http.TestProtocolHttp FAILED
> {code}
> Evidence of failure on Jenkins go back to
> Failed > Console Output  #3154Jun 8, 2015 4:00:00 AM
> https://builds.apache.org/view/All/job/Nutch-trunk/3154/consoleFull
> And are repeated at...
> https://builds.apache.org/view/All/job/Nutch-trunk/3190/console
> https://builds.apache.org/view/All/job/Nutch-trunk/3189/console
> Some possibly related tickets
> NUTCH-1836 Timeouts in protocol-httpclient when crawling same host with >2 
> threads 
> NUTCH-1086 Rewrite protocol-httpclient
> The unit tests are not failing for me on my sandbox, but there are some 
> exceptions being output to the log related to headers being sent on JSP pages 
> after the response writer is invoked.
> {code}
> java.lang.IllegalStateException: STREAM
> at org.mortbay.jetty.Response.getWriter(Response.java:616)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: GSOC2015- Sitemap crawler roudmap problems

2015-08-02 Thread Cihad Guzel
Hi

I am proceesing my work. My code is integreted nutch life cycle. Sitemap
files are can injeceted and parsed. You known, sitemap file have any tags
as lastmodified, priortiy and changefreq. Firstly, I put the tags value to
metadata. Then, I update last modified and fetch inteval field of webpage
as for the tags. But I didn't use priority tags. I want to calculate new
score using priority for list of urls from sitemap. While the urls of
sitemap have priority value, another webpage urls doesn't have the value.
There are disorder.  How do you think should be implemented it?

I attached the last code as patch on this email.


2015-07-11 12:10 GMT+03:00 Cihad Guzel :

> Hi Lewis.
>
> Thanks for your suggestions. I will be thinking about this.
>
> 2015-07-10 3:47 GMT+03:00 Lewis John Mcgibbney 
> :
>
>> Hi Cihad,
>> I'll take a look tonight.
>> My understanding is that this would be implemented as part of core and
>> not as a plugin. Within the plugin we can, at time, have acesss to less
>> verbose data structures. This is of course not always the case, but
>> generally speaking we see more issues, depending on which interfaces we
>> extend, with appropriate access to the correct data structures. We then
>> have the issue of dependency management.
>> I'll have a look through the various links you have sent and then write
>> back here in due course.
>> Apologies about the delay.
>> Thanks
>>
>> On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel  wrote:
>>
>>> Hi,
>>>
>>> I have find a patch for my metadata problem [1]. But , the problem isn't
>>> solved for 2.x [2]. I guess, I need to solve it.
>>>
>>> [1] https://issues.apache.org/jira/browse/NUTCH-1622
>>> [2] https://issues.apache.org/jira/browse/NUTCH-1816
>>>
>>> 2015-07-04 15:56 GMT+03:00 Cihad Guzel :
>>>
 Hi Lewis,

 I and Talat talk about architecture for sitemap supporting . We thought
 the problem could be solved in nutch life cycle . We don't want to build a
 different life cycle for sitemap crawling.

 So, I have some problems as following:

 If the sitemap file is too large size, it can not be fetched and
 parsed. It gets timeout. I solved timeout problem temporarily to parse by
 raising the value of timeout in nutch-site.xml and to fetch by working
 small size file. It is not good.

 Moreover, you know sitemap files have some special tags as "loc",
 "lastmod", "changefreq" or "priority". It has been parsed using my parse
 plugin. I want to  record to crawldb, but the Parse  object doesn't
 support metadata or same fields. It has only outlink array. It isn't enough
 for recording metadata.

 I want to record each url in sitemap file with the metadata seperately.

 I viewed all patchs and comments from NUTCH-1465 and there are some
 solution for same problems in it. But, new job for sitemap crawling have
 been created.

 Could you show me a way out?

 Thanks.

>>>
>>>
>>
>>
>> --
>> *Lewis*
>>
>
>
diff --git a/conf/gora-hbase-mapping.xml b/conf/gora-hbase-mapping.xml
index eb58819..5bd011b 100644
--- a/conf/gora-hbase-mapping.xml
+++ b/conf/gora-hbase-mapping.xml
@@ -46,6 +46,7 @@ http://gora.apache.org/current/gora-hbase.html
 
 
 
+
 
 
 
@@ -66,6 +67,8 @@ http://gora.apache.org/current/gora-hbase.html
 
 
 
+ 	
+ 
 
 
 
@@ -76,6 +79,8 @@ http://gora.apache.org/current/gora-hbase.html
 
 
 
+
+
 
 
 
diff --git a/conf/parse-plugins.xml b/conf/parse-plugins.xml
index 5b20be6..0551381 100644
--- a/conf/parse-plugins.xml
+++ b/conf/parse-plugins.xml
@@ -68,6 +68,7 @@
 		
 	
 
+

 
 	
diff --git a/src/gora/webpage.avsc b/src/gora/webpage.avsc
index dce0050..0761c08 100644
--- a/src/gora/webpage.avsc
+++ b/src/gora/webpage.avsc
@@ -278,6 +278,26 @@
   ],
   "doc": "A batchId that this WebPage is assigned to. WebPage's are fetched in batches, called fetchlists. Pages are partitioned but can always be associated and fetched alongside pages of similar value (within a crawl cycle) based on batchId.",
   "default": null
+},
+{
+  "name": "sitemaps",
+  "type": {
+"type": "map",
+"values": [
+  "null",
+  "string"
+]
+  },
+  "doc": "Sitemap urls in robot.txt",
+  "default": {
+
+  },
+  {
+ "name": "stmPriority",
+ "type": "float",
+ "doc": "",
+ "default": 0
+   },
 }
   ]
 }
diff --git a/src/java/org/apache/nutch/crawl/DbUpdateMapper.java b/src/java/org/apache/nutch/crawl/DbUpdateMapper.java
index bb2457f..0c9a36c 100644
--- a/src/java/org/apache/nutch/crawl/DbUpdateMapper.java
+++ b/src/java/org/apache/nutch/crawl/DbUpdateMapper.java
@@ -78,6 +78,18 @@ public class DbUpdateMapper extends
   }
 }
 
+Map sitemaps= page.getSi