[jira] [Updated] (NUTCH-1747) Use AtomicInteger as semaphore in Fetcher

2014-04-14 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1747?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1747:
-

Fix Version/s: 1.9
 Assignee: Julien Nioche

 Use AtomicInteger as semaphore in Fetcher
 -

 Key: NUTCH-1747
 URL: https://issues.apache.org/jira/browse/NUTCH-1747
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 1.8
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
 Fix For: 1.9

 Attachments: NUTCH-1747-trunk.patch


 In Fetcher we currently use 
 SetFetchItem  inProgress = Collections.synchronizedSet(new 
 HashSetFetchItem());
 as semaphore within the FetchItemQueues to keep track of the URLs being 
 fetched and prevent threads from pulling from them. It works fine but we 
 could use AtomicIntegers instead as all we need is the counts, not the 
 contents.
 This change would have little impact on the performance but would make the 
 code a bit cleaner.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-04-14 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968270#comment-13968270
 ] 

Markus Jelsma commented on NUTCH-1708:
--

Yes, that seems reasonable, but we still need to get rid of the repr_url. To me 
it makes little sense to have  such strange behaviour in index-basic.

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel

 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Resolved] (NUTCH-1731) Better cmd line parsing for NutchServer

2014-04-14 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1731.
-

Resolution: Fixed

Committed @revision 1587275 in 2.x HEAD
Thank you [~fjodor.vershinin] :)
I changed some trivial things for logging and for argument params. Tested patch 
from different terminals and all working fine. 

 Better cmd line parsing for NutchServer
 ---

 Key: NUTCH-1731
 URL: https://issues.apache.org/jira/browse/NUTCH-1731
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api
Affects Versions: 2.2
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1731.patch, commandline.patch


 We can't currently stop a running server without killing the job via pid or 
 something similar.
 A simple switch should be added to permit this.
 All is needs to do is call NutchServer#stop which will check to see if there 
 are running tasks... if not then gracefully shut down the server instance. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1756) Security layer for NutchServer

2014-04-14 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1756:
---

 Summary: Security layer for NutchServer
 Key: NUTCH-1756
 URL: https://issues.apache.org/jira/browse/NUTCH-1756
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api, web gui
Reporter: Lewis John McGibbney
Priority: Critical
 Fix For: 2.4


It will be beneficial to have a security layer for NutchServer once we make 
improvements upon it. I hope that GSoC goes ahead this year so we can tackle 
such issues.
This issue should implement a standard security layer for REST API calls.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1756) Security layer for NutchServer

2014-04-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968650#comment-13968650
 ] 

Lewis John McGibbney commented on NUTCH-1756:
-

Hi [~fjodor.vershinin] please see this issue also for GSoC inclusion. Maybe it 
is something we can work towards :)

 Security layer for NutchServer
 --

 Key: NUTCH-1756
 URL: https://issues.apache.org/jira/browse/NUTCH-1756
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api, web gui
Reporter: Lewis John McGibbney
Priority: Critical
 Fix For: 2.4


 It will be beneficial to have a security layer for NutchServer once we make 
 improvements upon it. I hope that GSoC goes ahead this year so we can tackle 
 such issues.
 This issue should implement a standard security layer for REST API calls.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1731) Better cmd line parsing for NutchServer

2014-04-14 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968652#comment-13968652
 ] 

Lewis John McGibbney commented on NUTCH-1731:
-

[~fjodor.vershinin] can you please provide your wiki username to me via lewismc 
[at] apache [dot] org and I will add you so we cann add documentation for the 
NutchServer tool. Thank you v much.

 Better cmd line parsing for NutchServer
 ---

 Key: NUTCH-1731
 URL: https://issues.apache.org/jira/browse/NUTCH-1731
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api
Affects Versions: 2.2
Reporter: Lewis John McGibbney
Assignee: Fjodor Vershinin
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1731.patch, commandline.patch


 We can't currently stop a running server without killing the job via pid or 
 something similar.
 A simple switch should be added to permit this.
 All is needs to do is call NutchServer#stop which will check to see if there 
 are running tasks... if not then gracefully shut down the server instance. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1731) Better cmd line parsing for NutchServer

2014-04-14 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968665#comment-13968665
 ] 

Hudson commented on NUTCH-1731:
---

SUCCESS: Integrated in Nutch-nutchgora #987 (See 
[https://builds.apache.org/job/Nutch-nutchgora/987/])
NUTCH-1731 Better cmd line parsing for NutchServer (lewismc: 
http://svn.apache.org/viewvc/nutch/branches/2.x/?view=revrev=1587275)
* /nutch/branches/2.x/CHANGES.txt
* /nutch/branches/2.x/conf/log4j.properties
* /nutch/branches/2.x/src/java/org/apache/nutch/api/AdminResource.java
* /nutch/branches/2.x/src/java/org/apache/nutch/api/JobManager.java
* /nutch/branches/2.x/src/java/org/apache/nutch/api/NutchApp.java
* /nutch/branches/2.x/src/java/org/apache/nutch/api/NutchServer.java
* /nutch/branches/2.x/src/java/org/apache/nutch/api/Params.java


 Better cmd line parsing for NutchServer
 ---

 Key: NUTCH-1731
 URL: https://issues.apache.org/jira/browse/NUTCH-1731
 Project: Nutch
  Issue Type: Improvement
  Components: REST_api
Affects Versions: 2.2
Reporter: Lewis John McGibbney
Assignee: Fjodor Vershinin
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1731.patch, commandline.patch


 We can't currently stop a running server without killing the job via pid or 
 something similar.
 A simple switch should be added to permit this.
 All is needs to do is call NutchServer#stop which will check to see if there 
 are running tasks... if not then gracefully shut down the server instance. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1708) use same id when indexing and deleting redirects

2014-04-14 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968762#comment-13968762
 ] 

Sebastian Nagel commented on NUTCH-1708:


??need to get rid of the repr_url??
Not necessarily:
# if we use for field 'id' the URL a document has been accessed (with any 
possible status), everything (indexing, updating, deletion) should work -- 
those IDs are in sync with CrawlDb and may never appear twice.
# then we are free to fill the field 'url' with a more pretty thing: repr URL 
(usually shorter), punycoded (without ugly {{xn--}}), showing letters instead 
of percent-encoded sequences, etc. Since field 'url' is tokenized, decoding the 
content makes more sense. In doubt, we could make it configurable which of 
these denormalization steps are applied.
# finally, we achieve the same behaviour in 1.x and 2.x

 use same id when indexing and deleting redirects
 

 Key: NUTCH-1708
 URL: https://issues.apache.org/jira/browse/NUTCH-1708
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 1.7
Reporter: Sebastian Nagel

 Redirect targets are indexed using representative URL
 * in Fetcher repr URL is determined by URLUtil.chooseRepr() and stored in 
 CrawlDatum (CrawlDb). Repr URL is either source or target URL of the redirect 
 pair.
 * NutchField url is filled by basic indexing filter with repr URL
 * id field used as unique key is filled from url per solrindex-mapping.xml
 Deletion of redirects is done in IndexerMapReduce.reduce() by key which is 
 the URL of the redirect source. If the source URL is chosen as repr URL a 
 redirect target may get erroneously deleted.
 Test crawl with seed {{http://wiki.apache.org/nutch}} which redirects to 
 {{http://wiki.apache.org/nutch/}}. DummyIndexWriter (NUTCH-1707) indicates 
 that same URL is deleted and added:
 {code}
 delete  http://wiki.apache.org/nutch
 add http://wiki.apache.org/nutch
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1748) urlfilter-validator to allow .. (two dots) inside file names (path elements)

2014-04-14 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13968882#comment-13968882
 ] 

Sebastian Nagel commented on NUTCH-1748:


Hi [~Sertac Turkel], thanks, +1 for the unit tests.
I'm not sure about the original intention of urlfilter-validator (and its 
source [commons' 
UrlValidator|http://commons.apache.org/proper/commons-validator/javadocs/api-1.4.0/org/apache/commons/validator/routines/UrlValidator.html]):
 it's not the exclusion of URLs containing dot elements in the path (sorry, 
I've been wrong). Otherwise, counting .. and slashes in the path and 
comparing their numbers is rather naive and does not check anything in a 
systematic way:
{code}
assertNotNull(url_validator.filter(http://alfa.bravo.pi/a/../..;)); // fails
assertNotNull(url_validator.filter(http://alfa.bravo.pi/a/./././../..;)); // 
succeeds!
{code}
Maybe the intention was to exclude paths which go beyond the server root if 
there are too many .. elements. But behaviour is explicitly defined in 
[RFC3986 remove_dot_segments|http://tools.ietf.org/html/rfc3986#section-5.2.4] 
and modern browsers resolve (normalize) such URLs correctly.

In general, it would make sense to reject any URLs containing dot elements or 
empty elements in the path: The complete path segments '.' and '..' are 
intended only for use within relative references 
([RFC3896|http://tools.ietf.org/html/rfc3986#section-6.2.2.3]). However, this 
would require some more work.

Comments are welcome about the desired behaviour!

 urlfilter-validator to allow .. (two dots) inside file names (path elements)
 

 Key: NUTCH-1748
 URL: https://issues.apache.org/jira/browse/NUTCH-1748
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.2.1
Reporter: Sertac TURKEL
Priority: Minor
 Fix For: 2.3

 Attachments: NUTCH-1748.patch


 Unix systems accept files containing two dots abc..xyz.txt. So
 urlfilter-validator should not  reject this kind of urls. Also paths 
 containing /../ or /.. in final position should be still rejected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)