[jira] [Updated] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1747: - Fix Version/s: 1.9 Assignee: Julien Nioche Use AtomicInteger as semaphore in Fetcher - Key: NUTCH-1747 URL: https://issues.apache.org/jira/browse/NUTCH-1747 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 1.8 Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Fix For: 1.9 Attachments: NUTCH-1747-trunk.patch In Fetcher we currently use SetFetchItem inProgress = Collections.synchronizedSet(new HashSetFetchItem()); as semaphore within the FetchItemQueues to keep track of the URLs being fetched and prevent threads from pulling from them. It works fine but we could use AtomicIntegers instead as all we need is the counts, not the contents. This change would have little impact on the performance but would make the code a bit cleaner. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968270#comment-13968270 ] Markus Jelsma commented on NUTCH-1708: -- Yes, that seems reasonable, but we still need to get rid of the repr_url. To me it makes little sense to have such strange behaviour in index-basic. use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (NUTCH-1731) Better cmd line parsing for NutchServer
[ https://issues.apache.org/jira/browse/NUTCH-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney resolved NUTCH-1731. - Resolution: Fixed Committed @revision 1587275 in 2.x HEAD Thank you [~fjodor.vershinin] :) I changed some trivial things for logging and for argument params. Tested patch from different terminals and all working fine. Better cmd line parsing for NutchServer --- Key: NUTCH-1731 URL: https://issues.apache.org/jira/browse/NUTCH-1731 Project: Nutch Issue Type: Improvement Components: REST_api Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Lewis John McGibbney Priority: Minor Fix For: 2.3 Attachments: NUTCH-1731.patch, commandline.patch We can't currently stop a running server without killing the job via pid or something similar. A simple switch should be added to permit this. All is needs to do is call NutchServer#stop which will check to see if there are running tasks... if not then gracefully shut down the server instance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (NUTCH-1756) Security layer for NutchServer
Lewis John McGibbney created NUTCH-1756: --- Summary: Security layer for NutchServer Key: NUTCH-1756 URL: https://issues.apache.org/jira/browse/NUTCH-1756 Project: Nutch Issue Type: Improvement Components: REST_api, web gui Reporter: Lewis John McGibbney Priority: Critical Fix For: 2.4 It will be beneficial to have a security layer for NutchServer once we make improvements upon it. I hope that GSoC goes ahead this year so we can tackle such issues. This issue should implement a standard security layer for REST API calls. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1756) Security layer for NutchServer
[ https://issues.apache.org/jira/browse/NUTCH-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968650#comment-13968650 ] Lewis John McGibbney commented on NUTCH-1756: - Hi [~fjodor.vershinin] please see this issue also for GSoC inclusion. Maybe it is something we can work towards :) Security layer for NutchServer -- Key: NUTCH-1756 URL: https://issues.apache.org/jira/browse/NUTCH-1756 Project: Nutch Issue Type: Improvement Components: REST_api, web gui Reporter: Lewis John McGibbney Priority: Critical Fix For: 2.4 It will be beneficial to have a security layer for NutchServer once we make improvements upon it. I hope that GSoC goes ahead this year so we can tackle such issues. This issue should implement a standard security layer for REST API calls. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1731) Better cmd line parsing for NutchServer
[ https://issues.apache.org/jira/browse/NUTCH-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968652#comment-13968652 ] Lewis John McGibbney commented on NUTCH-1731: - [~fjodor.vershinin] can you please provide your wiki username to me via lewismc [at] apache [dot] org and I will add you so we cann add documentation for the NutchServer tool. Thank you v much. Better cmd line parsing for NutchServer --- Key: NUTCH-1731 URL: https://issues.apache.org/jira/browse/NUTCH-1731 Project: Nutch Issue Type: Improvement Components: REST_api Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Fjodor Vershinin Priority: Minor Fix For: 2.3 Attachments: NUTCH-1731.patch, commandline.patch We can't currently stop a running server without killing the job via pid or something similar. A simple switch should be added to permit this. All is needs to do is call NutchServer#stop which will check to see if there are running tasks... if not then gracefully shut down the server instance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1731) Better cmd line parsing for NutchServer
[ https://issues.apache.org/jira/browse/NUTCH-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968665#comment-13968665 ] Hudson commented on NUTCH-1731: --- SUCCESS: Integrated in Nutch-nutchgora #987 (See [https://builds.apache.org/job/Nutch-nutchgora/987/]) NUTCH-1731 Better cmd line parsing for NutchServer (lewismc: http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1587275) * /nutch/branches/2.x/CHANGES.txt * /nutch/branches/2.x/conf/log4j.properties * /nutch/branches/2.x/src/java/org/apache/nutch/api/AdminResource.java * /nutch/branches/2.x/src/java/org/apache/nutch/api/JobManager.java * /nutch/branches/2.x/src/java/org/apache/nutch/api/NutchApp.java * /nutch/branches/2.x/src/java/org/apache/nutch/api/NutchServer.java * /nutch/branches/2.x/src/java/org/apache/nutch/api/Params.java Better cmd line parsing for NutchServer --- Key: NUTCH-1731 URL: https://issues.apache.org/jira/browse/NUTCH-1731 Project: Nutch Issue Type: Improvement Components: REST_api Affects Versions: 2.2 Reporter: Lewis John McGibbney Assignee: Fjodor Vershinin Priority: Minor Fix For: 2.3 Attachments: NUTCH-1731.patch, commandline.patch We can't currently stop a running server without killing the job via pid or something similar. A simple switch should be added to permit this. All is needs to do is call NutchServer#stop which will check to see if there are running tasks... if not then gracefully shut down the server instance. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects
[ https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968762#comment-13968762 ] Sebastian Nagel commented on NUTCH-1708: ??need to get rid of the repr_url?? Not necessarily: # if we use for field 'id' the URL a document has been accessed (with any possible status), everything (indexing, updating, deletion) should work -- those IDs are in sync with CrawlDb and may never appear twice. # then we are free to fill the field 'url' with a more pretty thing: repr URL (usually shorter), punycoded (without ugly {{xn--}}), showing letters instead of percent-encoded sequences, etc. Since field 'url' is tokenized, decoding the content makes more sense. In doubt, we could make it configurable which of these denormalization steps are applied. # finally, we achieve the same behaviour in 1.x and 2.x use same id when indexing and deleting redirects Key: NUTCH-1708 URL: https://issues.apache.org/jira/browse/NUTCH-1708 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.7 Reporter: Sebastian Nagel Redirect targets are indexed using representative URL * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect pair. * NutchField url is filled by basic indexing filter with repr URL * id field used as unique key is filled from url per solrindex-mapping.xml Deletion of redirects is done in IndexerMapReduce.reduce() by key which is the URL of the redirect source. If the source URL is chosen as repr URL a redirect target may get erroneously deleted. Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates that same URL is deleted and added: {code} delete http://wiki.apache.org/nutch add http://wiki.apache.org/nutch {code} -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (NUTCH-1748) urlfilter-validator to allow .. (two dots) inside file names (path elements)
[ https://issues.apache.org/jira/browse/NUTCH-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968882#comment-13968882 ] Sebastian Nagel commented on NUTCH-1748: Hi [~Sertac Turkel], thanks, +1 for the unit tests. I'm not sure about the original intention of urlfilter-validator (and its source [commons' UrlValidator|http://commons.apache.org/proper/commons-validator/javadocs/api-1.4.0/org/apache/commons/validator/routines/UrlValidator.html]): it's not the exclusion of URLs containing dot elements in the path (sorry, I've been wrong). Otherwise, counting .. and slashes in the path and comparing their numbers is rather naive and does not check anything in a systematic way: {code} assertNotNull(url_validator.filter(http://alfa.bravo.pi/a/../..;)); // fails assertNotNull(url_validator.filter(http://alfa.bravo.pi/a/./././../..;)); // succeeds! {code} Maybe the intention was to exclude paths which go beyond the server root if there are too many .. elements. But behaviour is explicitly defined in [RFC3986 remove_dot_segments|http://tools.ietf.org/html/rfc3986#section-5.2.4] and modern browsers resolve (normalize) such URLs correctly. In general, it would make sense to reject any URLs containing dot elements or empty elements in the path: The complete path segments '.' and '..' are intended only for use within relative references ([RFC3896|http://tools.ietf.org/html/rfc3986#section-6.2.2.3]). However, this would require some more work. Comments are welcome about the desired behaviour! urlfilter-validator to allow .. (two dots) inside file names (path elements) Key: NUTCH-1748 URL: https://issues.apache.org/jira/browse/NUTCH-1748 Project: Nutch Issue Type: Bug Affects Versions: 2.2.1 Reporter: Sertac TURKEL Priority: Minor Fix For: 2.3 Attachments: NUTCH-1748.patch Unix systems accept files containing two dots abc..xyz.txt. So urlfilter-validator should not reject this kind of urls. Also paths containing /../ or /.. in final position should be still rejected. -- This message was sent by Atlassian JIRA (v6.2#6252)