[jira] [Commented] (NUTCH-2027) seed list REST endpoint for Nutch 1.10

2015-06-04 Thread Chris A. Mattmann (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573984#comment-14573984
 ] 

Chris A. Mattmann commented on NUTCH-2027:
--

Committed, thanks Asitang!

{noformat}
[chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m Fix for NUTCH-2027 seed 
list REST endpoint for Nutch 1.10 contributed by Asitang Mishra 
asit...@gmail.com this closes #28.
SendingCHANGES.txt
Sendingsrc/java/org/apache/nutch/service/NutchServer.java
Adding src/java/org/apache/nutch/service/model/request/SeedList.java
Adding src/java/org/apache/nutch/service/resources/SeedResource.java
Transmitting file data 
Committed revision 1683657.
[chipotle:~/tmp/nutch-trunk] mattmann% 
{noformat}


 seed list REST endpoint for Nutch 1.10
 --

 Key: NUTCH-2027
 URL: https://issues.apache.org/jira/browse/NUTCH-2027
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, rest_api
 Fix For: 1.11


 The endpoint for Nutch 1.10 that enables the user to set the seedlist for the 
 REST api with a REST call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2027) seed list REST endpoint for Nutch 1.10

2015-06-04 Thread Asitang Mishra (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573895#comment-14573895
 ] 

Asitang Mishra commented on NUTCH-2027:
---

Done

 seed list REST endpoint for Nutch 1.10
 --

 Key: NUTCH-2027
 URL: https://issues.apache.org/jira/browse/NUTCH-2027
 Project: Nutch
  Issue Type: New Feature
  Components: REST_api
Reporter: Asitang Mishra
Assignee: Chris A. Mattmann
  Labels: memex, rest_api
 Fix For: 1.11


 The endpoint for Nutch 1.10 that enables the user to set the seedlist for the 
 REST api with a REST call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2032) Plugin to index the raw content of a readable document.

2015-06-04 Thread Luis Lopez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573148#comment-14573148
 ] 

Luis Lopez commented on NUTCH-2032:
---

Hi [~wastl-nagel], could you elaborate on what seems favourable? Yes this will 
increase the size of the segments which is non trivial. I think that this 
plugin approach is less intrusive with the current class signatures. It works 
well with our use case in which we don't need the segments once that they are 
indexed.

 Plugin to index the raw content of a readable document. 
 

 Key: NUTCH-2032
 URL: https://issues.apache.org/jira/browse/NUTCH-2032
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, parser
Affects Versions: 1.10
Reporter: Luis Lopez
  Labels: content, index, index-rawcontent, parser, raw
 Fix For: 1.11


 This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and 
 https://issues.apache.org/jira/browse/NUTCH-1458
 We created a couple plugins to index the raw content of readable documents. 
 If we include these plugins in the plugin chain we'll index the raw content 
 of a readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent 
 plugin is not designed to index binary files, however having the full content 
 of an HTML/XML or a CSV document is really critical for some of us.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2034) CrawlDB filtered documents counter.

2015-06-04 Thread Luis Lopez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573114#comment-14573114
 ] 

Luis Lopez commented on NUTCH-2034:
---

Yes, we can use a general counter and say that or we could even be more 
specific and count by filter.

 CrawlDB filtered documents counter.
 ---

 Key: NUTCH-2034
 URL: https://issues.apache.org/jira/browse/NUTCH-2034
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.10
Reporter: Luis Lopez
Priority: Minor
  Labels: counters, crawldb, filter, info, regex
 Fix For: 1.11


 When we are doing big crawls we would like to know how many of the URLs are 
 being discarded by the regex filters, this is only presented in the Inject 
 class:
 Injector: Total number of urls rejected by filters: 0
 It will be nice to have a counter in the CrawlDB class so we know in every 
 round how many were discarded by our filters:
 CrawlDb update: Total number of URLs filtered by regex filters: 31415



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.

2015-06-04 Thread Luis Lopez (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573108#comment-14573108
 ] 

Luis Lopez commented on NUTCH-2035:
---

Hi [~wastl-nagel], after doing several tests with 100k URLs in my laptop (Core 
i7 4650U, 8GB Ram, OpenJDK 1.7.0_79) I can confirm that you are right, the 
performance improvement is neglectable, in average I got 19.91s vs 19.54s, for 
the url.toLowerCase(). However I got a better improvement shortening the rules 
and writing them to be case insensitive (?i). 18.42s. 

We are using different regex rules. I'm attaching one of our files. 
In any case I think that the rules should be simplified in the default Nutch 
regex file. Also, in the tests we can use lower case or case insensitive rules 
instead of i.e. MY.DOMAIN.NAME. 

 Regex filter using case sensitive rules.
 

 Key: NUTCH-2035
 URL: https://issues.apache.org/jira/browse/NUTCH-2035
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Luis Lopez
Priority: Minor
  Labels: filters, regex, regex-urlfilter
 Fix For: 1.11

 Attachments: regex-urlfilter.txt


 Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” 
 etc etc. adds up if we use complex rules.
 Regex filter should use case insensitive rules to make the rules more 
 readable and improve performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2035) Regex filter using case sensitive rules.

2015-06-04 Thread Luis Lopez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Lopez updated NUTCH-2035:
--
Attachment: regex-urlfilter.txt

 Regex filter using case sensitive rules.
 

 Key: NUTCH-2035
 URL: https://issues.apache.org/jira/browse/NUTCH-2035
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Affects Versions: 1.10
Reporter: Luis Lopez
Priority: Minor
  Labels: filters, regex, regex-urlfilter
 Fix For: 1.11

 Attachments: regex-urlfilter.txt


 Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” 
 etc etc. adds up if we use complex rules.
 Regex filter should use case insensitive rules to make the rules more 
 readable and improve performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-04 Thread Jorge Luis Betancourt Gonzalez (JIRA)
Jorge Luis Betancourt Gonzalez created NUTCH-2036:
-

 Summary: Adding some continuous crawl goodies to the crawl script
 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10, 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor


Although Nutch does not support continuous crawling out of the box, and yes 
this is somehow doable using cron or even sometimes irrelevant due the size of 
the crawl its a nice feature to have. 

This patch basically just adds a new parameter option to the {{bin/crawl}} 
script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
no URLs are scheduled for fetching). 

This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
provided the amount of time is assumed to be in seconds. Other valid suffixes 
are: 

s - second
m - minutes
h - hours
d - days

If a {{-1}} value is passed to the parameter or its not used at all the default 
behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script

2015-06-04 Thread Jorge Luis Betancourt Gonzalez (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jorge Luis Betancourt Gonzalez updated NUTCH-2036:
--
Attachment: NUTCH-2036.patch

 Adding some continuous crawl goodies to the crawl script
 

 Key: NUTCH-2036
 URL: https://issues.apache.org/jira/browse/NUTCH-2036
 Project: Nutch
  Issue Type: Improvement
  Components: bin, tool, util
Affects Versions: 1.10, 1.11
Reporter: Jorge Luis Betancourt Gonzalez
Priority: Minor
  Labels: crawl, script
 Attachments: NUTCH-2036.patch


 Although Nutch does not support continuous crawling out of the box, and yes 
 this is somehow doable using cron or even sometimes irrelevant due the size 
 of the crawl its a nice feature to have. 
 This patch basically just adds a new parameter option to the {{bin/crawl}} 
 script (-w|--wait) which adds a time to wait if the generator returns 0 (when 
 no URLs are scheduled for fetching). 
 This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is 
 provided the amount of time is assumed to be in seconds. Other valid suffixes 
 are: 
 s - second
 m - minutes
 h - hours
 d - days
 If a {{-1}} value is passed to the parameter or its not used at all the 
 default behaviour of exciting the script is used.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)