[jira] [Commented] (NUTCH-2027) seed list REST endpoint for Nutch 1.10
[ https://issues.apache.org/jira/browse/NUTCH-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573984#comment-14573984 ] Chris A. Mattmann commented on NUTCH-2027: -- Committed, thanks Asitang! {noformat} [chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m Fix for NUTCH-2027 seed list REST endpoint for Nutch 1.10 contributed by Asitang Mishra asit...@gmail.com this closes #28. SendingCHANGES.txt Sendingsrc/java/org/apache/nutch/service/NutchServer.java Adding src/java/org/apache/nutch/service/model/request/SeedList.java Adding src/java/org/apache/nutch/service/resources/SeedResource.java Transmitting file data Committed revision 1683657. [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} seed list REST endpoint for Nutch 1.10 -- Key: NUTCH-2027 URL: https://issues.apache.org/jira/browse/NUTCH-2027 Project: Nutch Issue Type: New Feature Components: REST_api Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, rest_api Fix For: 1.11 The endpoint for Nutch 1.10 that enables the user to set the seedlist for the REST api with a REST call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2027) seed list REST endpoint for Nutch 1.10
[ https://issues.apache.org/jira/browse/NUTCH-2027?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573895#comment-14573895 ] Asitang Mishra commented on NUTCH-2027: --- Done seed list REST endpoint for Nutch 1.10 -- Key: NUTCH-2027 URL: https://issues.apache.org/jira/browse/NUTCH-2027 Project: Nutch Issue Type: New Feature Components: REST_api Reporter: Asitang Mishra Assignee: Chris A. Mattmann Labels: memex, rest_api Fix For: 1.11 The endpoint for Nutch 1.10 that enables the user to set the seedlist for the REST api with a REST call. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2032) Plugin to index the raw content of a readable document.
[ https://issues.apache.org/jira/browse/NUTCH-2032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573148#comment-14573148 ] Luis Lopez commented on NUTCH-2032: --- Hi [~wastl-nagel], could you elaborate on what seems favourable? Yes this will increase the size of the segments which is non trivial. I think that this plugin approach is less intrusive with the current class signatures. It works well with our use case in which we don't need the segments once that they are indexed. Plugin to index the raw content of a readable document. Key: NUTCH-2032 URL: https://issues.apache.org/jira/browse/NUTCH-2032 Project: Nutch Issue Type: New Feature Components: indexer, parser Affects Versions: 1.10 Reporter: Luis Lopez Labels: content, index, index-rawcontent, parser, raw Fix For: 1.11 This is related to https://issues.apache.org/jira/browse/NUTCH-1785 and https://issues.apache.org/jira/browse/NUTCH-1458 We created a couple plugins to index the raw content of readable documents. If we include these plugins in the plugin chain we'll index the raw content of a readable document, i.e. XML, HTML, CSV, TXT etc. The index-rawcontent plugin is not designed to index binary files, however having the full content of an HTML/XML or a CSV document is really critical for some of us. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2034) CrawlDB filtered documents counter.
[ https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573114#comment-14573114 ] Luis Lopez commented on NUTCH-2034: --- Yes, we can use a general counter and say that or we could even be more specific and count by filter. CrawlDB filtered documents counter. --- Key: NUTCH-2034 URL: https://issues.apache.org/jira/browse/NUTCH-2034 Project: Nutch Issue Type: Improvement Components: crawldb Affects Versions: 1.10 Reporter: Luis Lopez Priority: Minor Labels: counters, crawldb, filter, info, regex Fix For: 1.11 When we are doing big crawls we would like to know how many of the URLs are being discarded by the regex filters, this is only presented in the Inject class: Injector: Total number of urls rejected by filters: 0 It will be nice to have a counter in the CrawlDB class so we know in every round how many were discarded by our filters: CrawlDb update: Total number of URLs filtered by regex filters: 31415 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.
[ https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573108#comment-14573108 ] Luis Lopez commented on NUTCH-2035: --- Hi [~wastl-nagel], after doing several tests with 100k URLs in my laptop (Core i7 4650U, 8GB Ram, OpenJDK 1.7.0_79) I can confirm that you are right, the performance improvement is neglectable, in average I got 19.91s vs 19.54s, for the url.toLowerCase(). However I got a better improvement shortening the rules and writing them to be case insensitive (?i). 18.42s. We are using different regex rules. I'm attaching one of our files. In any case I think that the rules should be simplified in the default Nutch regex file. Also, in the tests we can use lower case or case insensitive rules instead of i.e. MY.DOMAIN.NAME. Regex filter using case sensitive rules. Key: NUTCH-2035 URL: https://issues.apache.org/jira/browse/NUTCH-2035 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Luis Lopez Priority: Minor Labels: filters, regex, regex-urlfilter Fix For: 1.11 Attachments: regex-urlfilter.txt Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” etc etc. adds up if we use complex rules. Regex filter should use case insensitive rules to make the rules more readable and improve performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2035) Regex filter using case sensitive rules.
[ https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Luis Lopez updated NUTCH-2035: -- Attachment: regex-urlfilter.txt Regex filter using case sensitive rules. Key: NUTCH-2035 URL: https://issues.apache.org/jira/browse/NUTCH-2035 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Luis Lopez Priority: Minor Labels: filters, regex, regex-urlfilter Fix For: 1.11 Attachments: regex-urlfilter.txt Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” etc etc. adds up if we use complex rules. Regex filter should use case insensitive rules to make the rules more readable and improve performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script
Jorge Luis Betancourt Gonzalez created NUTCH-2036: - Summary: Adding some continuous crawl goodies to the crawl script Key: NUTCH-2036 URL: https://issues.apache.org/jira/browse/NUTCH-2036 Project: Nutch Issue Type: Improvement Components: bin, tool, util Affects Versions: 1.10, 1.11 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Although Nutch does not support continuous crawling out of the box, and yes this is somehow doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature to have. This patch basically just adds a new parameter option to the {{bin/crawl}} script (-w|--wait) which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching). This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is provided the amount of time is assumed to be in seconds. Other valid suffixes are: s - second m - minutes h - hours d - days If a {{-1}} value is passed to the parameter or its not used at all the default behaviour of exciting the script is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2036) Adding some continuous crawl goodies to the crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-2036: -- Attachment: NUTCH-2036.patch Adding some continuous crawl goodies to the crawl script Key: NUTCH-2036 URL: https://issues.apache.org/jira/browse/NUTCH-2036 Project: Nutch Issue Type: Improvement Components: bin, tool, util Affects Versions: 1.10, 1.11 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Labels: crawl, script Attachments: NUTCH-2036.patch Although Nutch does not support continuous crawling out of the box, and yes this is somehow doable using cron or even sometimes irrelevant due the size of the crawl its a nice feature to have. This patch basically just adds a new parameter option to the {{bin/crawl}} script (-w|--wait) which adds a time to wait if the generator returns 0 (when no URLs are scheduled for fetching). This new parameter has the {{NUMBER\[SUFFIX\]}} format, if no suffix is provided the amount of time is assumed to be in seconds. Other valid suffixes are: s - second m - minutes h - hours d - days If a {{-1}} value is passed to the parameter or its not used at all the default behaviour of exciting the script is used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)