[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-06-12 Thread Omkar Reddy (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509469#comment-16509469
 ] 

Omkar Reddy commented on NUTCH-2557:


A simple and wise solution. Thanks. 

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid

2018-05-25 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490581#comment-16490581
 ] 

Omkar Reddy commented on NUTCH-2557:


I agree, sometimes the http body of bad requests and redirects might contain 
some kind of diagnostic information that might be helpful to the user. So we 
should store it optionally. 

Can we add the property as http.content.store.3XX.404? or is it a complicated 
name for a property?  

> protocol-http fails to follow redirections when an HTTP response body is 
> invalid
> 
>
> Key: NUTCH-2557
> URL: https://issues.apache.org/jira/browse/NUTCH-2557
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> If a server sends a redirection (3XX status code, with a Location header), 
> protocol-http tries to parse the HTTP response body anyway. Thus, if an error 
> occurs while decoding the body, the redirection is not followed and the 
> information is lost. Browsers follow the redirection and close the socket 
> soon as they can.
>  * Example: this page is a redirection to its https version, with an HTTP 
> body containing invalidly gzip encoded contents. Browsers follow the 
> redirection, but nutch throws an error:
>  ** [http://www.webarcelona.net/es/blog?page=2]
>  
> The HttpResponse::getContent class can already return null. I think it should 
> at least return null when parsing the HTTP response body fails.
> Ideally, we would adopt the same behavior as browsers, and not even try 
> parsing the body when the headers indicate a redirection.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-24 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16488900#comment-16488900
 ] 

Omkar Reddy commented on NUTCH-2575:


I have taken up [NUTCH-2557|https://issues.apache.org/jira/browse/NUTCH-2557] 
and started working on it. Thanks. 

> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
> Fix For: 1.15
>
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size

2018-05-06 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465037#comment-16465037
 ] 

Omkar Reddy commented on NUTCH-2575:


Hi [~gbouchar], I see the issue, while reading every chunk we are calculating 
the number of bytes read in the chunk with the variable: "chunkBytesRead" but 
it is not added into the "contentBytesRead" after reading the chunk. 

A simple solution is to do "contentBytesRead += chunkBytesRead" at the end of 
every chunk. This is should fix it. I will send a PR for this. Thanks. 

> protocol-http does not respect the maximum content-size
> ---
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Critical
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2553) Fetcher not to modify URLs to be fetched

2018-04-16 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439219#comment-16439219
 ] 

Omkar Reddy commented on NUTCH-2553:


WOW! :O I couldn't have figured that one on my own. Thanks [~wastl-nagel]

> Fetcher not to modify URLs to be fetched
> 
>
> Key: NUTCH-2553
> URL: https://issues.apache.org/jira/browse/NUTCH-2553
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> Fetcher modifies the URLs being fetched (introduced with NUTCH-2375 in 
> [c93d908|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#diff-847479d08597eb30da1c715310438685R253]:
> {noformat}
> FetcherThread 22 fetching http://nutch.apache.org:-1/ (queue crawl 
> delay=5000ms)
> {noformat}
> which makes it hard to trace the URLs in the log files and likely causes 
> other issues because URLs in CrawlDb and segments do not match 
> (http://nutch.apache.org/ in CrawlDb and http://nutch.apache.org:-1/ in 
> segment).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2551) NullPointerException in generator

2018-04-16 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439208#comment-16439208
 ] 

Omkar Reddy commented on NUTCH-2551:


Hello [~wastl-nagel] I used Hadoop-2.7.4, I will try and reproduce it using 
Hadoop-2.8.3 just as an exercise. Thanks for letting me know. I used 2.7.4 as 
the ivy configuration at 
[nutch|https://github.com/apache/nutch]/[ivy|https://github.com/apache/nutch/tree/master/ivy]/*ivy.xml*
 uses the mentioned version. 

[~HansBrende] thank you for the explanation and the patch. :) 

> NullPointerException in generator
> -
>
> Key: NUTCH-2551
> URL: https://issues.apache.org/jira/browse/NUTCH-2551
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.15
>Reporter: Hans Brende
>Assignee: Sebastian Nagel
>Priority: Blocker
> Fix For: 1.15
>
>
> A NullPointerException is thrown during the crawl generate stage when I 
> deploy to a hadoop cluster (but for some reason, it works fine locally).
> It looks like this is caused because the URLPartitioner class still has the 
> old {{configure()}} method in there (which is never called, causing the 
> {{normalizers}} field to remain null), rather than implementing the 
> {{Configurable}} interface as detailed in the newer mapreduce API's 
> Partitioner spec.
> Stack trace:
> {code}
> java.lang.NullPointerException
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76)
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716)
>  at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>  at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> {code}
>  
> Oh and it might also be because a *static* URLPartitioner instance is being 
> used in the Generator.Selector class... but it's only initialized in the 
> {{setup()}} method of the Generator.Selector.SelectorMapper class! So that 
> whole setup looks pretty weird...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2551) NullPointerException in generator

2018-04-11 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433705#comment-16433705
 ] 

Omkar Reddy commented on NUTCH-2551:


[~wastl-nagel], [~HansBrende], [~lewi...@apache.org] please let me know your 
thoughts on my explanation above so that I can send a PR with the fix. Thanks. 

> NullPointerException in generator
> -
>
> Key: NUTCH-2551
> URL: https://issues.apache.org/jira/browse/NUTCH-2551
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.15
>Reporter: Hans Brende
>Priority: Blocker
> Fix For: 1.15
>
>
> A NullPointerException is thrown during the crawl generate stage when I 
> deploy to a hadoop cluster (but for some reason, it works fine locally).
> It looks like this is caused because the URLPartitioner class still has the 
> old {{configure()}} method in there (which is never called, causing the 
> {{normalizers}} field to remain null), rather than implementing the 
> {{Configurable}} interface as detailed in the newer mapreduce API's 
> Partitioner spec.
> Stack trace:
> {code}
> java.lang.NullPointerException
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76)
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716)
>  at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>  at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> {code}
>  
> Oh and it might also be because a *static* URLPartitioner instance is being 
> used in the Generator.Selector class... but it's only initialized in the 
> {{setup()}} method of the Generator.Selector.SelectorMapper class! So that 
> whole setup looks pretty weird...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2551) NullPointerException in generator

2018-04-10 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431997#comment-16431997
 ] 

Omkar Reddy commented on NUTCH-2551:


Hi, [~HansBrende] I tried reproducing the error in pseudo-distributed mode on 
my local machine and I was unable to. Are there any specific steps to reproduce 
or just run a general crawl cycle? Thanks. 

> NullPointerException in generator
> -
>
> Key: NUTCH-2551
> URL: https://issues.apache.org/jira/browse/NUTCH-2551
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.15
>Reporter: Hans Brende
>Priority: Blocker
> Fix For: 1.15
>
>
> A NullPointerException is thrown during the crawl generate stage when I 
> deploy to a hadoop cluster (but for some reason, it works fine locally).
> It looks like this is caused because the URLPartitioner class still has the 
> old {{configure()}} method in there (which is never called, causing the 
> {{normalizers}} field to remain null), rather than implementing the 
> {{Configurable}} interface as detailed in the newer mapreduce API's 
> Partitioner spec.
> Stack trace:
> {code}
> java.lang.NullPointerException
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76)
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716)
>  at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>  at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> {code}
>  
> Oh and it might also be because a *static* URLPartitioner instance is being 
> used in the Generator.Selector class... but it's only initialized in the 
> {{setup()}} method of the Generator.Selector.SelectorMapper class! So that 
> whole setup looks pretty weird...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2553) Fetcher not to modify URLs to be fetched

2018-04-09 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431801#comment-16431801
 ] 

Omkar Reddy commented on NUTCH-2553:


[~wastl-nagel] I did not add anything that produces this specific change, this 
might be a result of some implementation change that I did in NUTCH-2375. I 
will find the root cause and let you know.

Thanks. 

> Fetcher not to modify URLs to be fetched
> 
>
> Key: NUTCH-2553
> URL: https://issues.apache.org/jira/browse/NUTCH-2553
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> Fetcher modifies the URLs being fetched (introduced with NUTCH-2375 in 
> [c93d908|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#diff-847479d08597eb30da1c715310438685R253]:
> {noformat}
> FetcherThread 22 fetching http://nutch.apache.org:-1/ (queue crawl 
> delay=5000ms)
> {noformat}
> which makes it hard to trace the URLs in the log files and likely causes 
> other issues because URLs in CrawlDb and segments do not match 
> (http://nutch.apache.org/ in CrawlDb and http://nutch.apache.org:-1/ in 
> segment).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2551) NullPointerException in generator

2018-04-09 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431798#comment-16431798
 ] 

Omkar Reddy commented on NUTCH-2551:


I think the issue here is that a new job is(job.getInstance) being created in 
the setup() of GeneratorSelectorMapper and that job is being passed when we are 
configuring the partitioner. This might be the reason for the configuration 
being lost and hence the nullPointerException.

I don't know why I created a new job in that patch(NUTCH-2375) rather than just 
passing the configuration object to URLPartitioner.configure() method, my bad. 
This is a quick fix and I will send a PR. Thanks. 

> NullPointerException in generator
> -
>
> Key: NUTCH-2551
> URL: https://issues.apache.org/jira/browse/NUTCH-2551
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.15
>Reporter: Hans Brende
>Priority: Blocker
> Fix For: 1.15
>
>
> A NullPointerException is thrown during the crawl generate stage when I 
> deploy to a hadoop cluster (but for some reason, it works fine locally).
> It looks like this is caused because the URLPartitioner class still has the 
> old {{configure()}} method in there (which is never called, causing the 
> {{normalizers}} field to remain null), rather than implementing the 
> {{Configurable}} interface as detailed in the newer mapreduce API's 
> Partitioner spec.
> Stack trace:
> {code}
> java.lang.NullPointerException
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76)
>  at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40)
>  at 
> org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716)
>  at 
> org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
>  at 
> org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553)
>  at 
> org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546)
>  at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>  at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>  at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>  at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169)
> {code}
>  
> Oh and it might also be because a *static* URLPartitioner instance is being 
> used in the Generator.Selector class... but it's only initialized in the 
> {{setup()}} method of the Generator.Selector.SelectorMapper class! So that 
> whole setup looks pretty weird...



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()

2018-03-27 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415426#comment-16415426
 ] 

Omkar Reddy commented on NUTCH-2518:


I might have missed this ticket.

Hi [~wastl-nagel], this was not covered in my PR for: 
[NUTCH-2375|https://github.com/apache/nutch/pull/221]. 

[~kpm1985], [~wastl-nagel], [~lewismc] I see there is a PR with just a minor 
change for this issue. I can take it up if it is not a problem. Please let me 
know anyways.

Thanks.    

> Must check return value of job.waitForCompletion()
> --
>
> Key: NUTCH-2518
> URL: https://issues.apache.org/jira/browse/NUTCH-2518
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher, generator, hostdb, linkdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Kenneth McFarland
>Priority: Blocker
> Fix For: 1.15
>
>
> The return value of job.waitForCompletion() of the new MapReduce API 
> (NUTCH-2375) must always be checked. If it's not true, the job has been 
> failed or killed. Accordingly, the program
> - should not proceed with further jobs/steps
> - must clean-up temporary data, unlock CrawlDB, etc.
> - exit with non-zero exit value, so that scripts running the crawl workflow 
> can handle the failure
> Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR 
> #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2383) Wrong FS exception in Fetcher

2017-11-08 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244019#comment-16244019
 ] 

Omkar Reddy commented on NUTCH-2383:


I recently faced this issue, we need to set the property: 
mapreduce.framework.name in mapred-site.xml as mentioned here [0] while 
configuring the cluster. 

[0] 
https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html

> Wrong FS exception in Fetcher
> -
>
> Key: NUTCH-2383
> URL: https://issues.apache.org/jira/browse/NUTCH-2383
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.13
> Environment: Hadoop 2.8 and Hadoop 2.7.2
>Reporter: Yossi Tamari
> Attachments: crawl output.txt
>
>
> Running bin/crawl on either Hadoop 2.7.2 or Hadoop 2.8, the Injector and 
> Generator succeed, but the Fetcher throws: 
> {code}java.lang.IllegalArgumentException: Wrong FS: 
> hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch, 
> expected: file:///{code}.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb

2017-11-03 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238278#comment-16238278
 ] 

Omkar Reddy commented on NUTCH-2442:


[~wastl-nagel] I am working on this on my local branch of NUTCH-2375. Just so 
that I do not head in the wrong direction, I was thinking the fix should be in 
the following manner :

try{
  boolean complete = job.waitForCompletion(true);
  if(!complete){
''' cleanup statements to revert any significant changes that happened 
during or before the job.'''
throw new Exception(" FAILED.");
  }
}catch(Exception e){
  throw e;
}

Please let me know if I need to add anything else or if there is any 
discrepancy in what I am doing above. Thanks. 


> Injector to stop if job fails to avoid loss of CrawlDb
> --
>
> Key: NUTCH-2442
> URL: https://issues.apache.org/jira/browse/NUTCH-2442
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> Injector does not check whether the MapReduce job is successful. Even if the 
> job fails
> - installs the CrawlDb
> -- move current/ to old/
> -- replace current/ with an empty or potentially incomplete version
> - exits with code 0 so that scripts running the crawl workflow cannot detect 
> the failure -- if Injector is run a second time the CrawlDb is lost (both 
> current/ and old/ are empty or corrupted)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2427) Remove all the Hadoop wildcard imports.

2017-09-20 Thread Omkar Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Reddy updated NUTCH-2427:
---
Labels: easyfix  (was: )

> Remove all the Hadoop wildcard imports.
> ---
>
> Key: NUTCH-2427
> URL: https://issues.apache.org/jira/browse/NUTCH-2427
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Reporter: Omkar Reddy
>Priority: Minor
>  Labels: easyfix
>
> This improvement deals with removing the wildcard imports like "import 
> org.apache.hadoop.package.* "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (NUTCH-2427) Remove all the Hadoop wildcard imports.

2017-09-20 Thread Omkar Reddy (JIRA)
Omkar Reddy created NUTCH-2427:
--

 Summary: Remove all the Hadoop wildcard imports.
 Key: NUTCH-2427
 URL: https://issues.apache.org/jira/browse/NUTCH-2427
 Project: Nutch
  Issue Type: Improvement
  Components: build
Reporter: Omkar Reddy
Priority: Minor


This improvement deals with removing the wildcard imports like "import 
org.apache.hadoop.package.* "



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

2017-04-27 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986881#comment-15986881
 ] 

Omkar Reddy commented on NUTCH-2375:


Hello dev@,

I am using the following url :  
https://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api, to 
upgrade the codebase. Please post on this thread if there is any discrepancy in 
the ppt in the above link.

Thanks,
Omkar.

> Upgrade the code base from org.apache.hadoop.mapred to 
> org.apache.hadoop.mapreduce
> --
>
> Key: NUTCH-2375
> URL: https://issues.apache.org/jira/browse/NUTCH-2375
> Project: Nutch
>  Issue Type: Improvement
>  Components: deployment
>Reporter: Omkar Reddy
>
> Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
> has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
> dependency. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce

2017-04-20 Thread Omkar Reddy (JIRA)
Omkar Reddy created NUTCH-2375:
--

 Summary: Upgrade the code base from org.apache.hadoop.mapred to 
org.apache.hadoop.mapreduce
 Key: NUTCH-2375
 URL: https://issues.apache.org/jira/browse/NUTCH-2375
 Project: Nutch
  Issue Type: Improvement
  Components: deployment
Reporter: Omkar Reddy


Nutch is still using the deprecated org.apache.hadoop.mapred dependency which 
has been deprecated. It need to be updated to org.apache.hadoop.mapreduce 
dependency. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2372) Javadocs build failing.

2017-04-10 Thread Omkar Reddy (JIRA)
Omkar Reddy created NUTCH-2372:
--

 Summary: Javadocs build failing.
 Key: NUTCH-2372
 URL: https://issues.apache.org/jira/browse/NUTCH-2372
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.13, 2.2.1
Reporter: Omkar Reddy
Priority: Minor


When we build javadocs of nutch using the command : "ant javadoc" 
we get a handful of errors and the build fails. This is because up to JDK 7, 
the Javadoc tool was pretty lenient. With JDK 8, a new part has been added to 
Javadoc called doclint and it changes that friendly behaviour. Warnings turned 
out into errors with JDK 8. 

The error log can be found here : https://paste.apache.org/sVQ5




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph

2017-03-16 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929431#comment-15929431
 ] 

Omkar Reddy commented on NUTCH-2369:


Branch 1.x [~lewismc]. Thanks.

> Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
> --
>
> Key: NUTCH-2369
> URL: https://issues.apache.org/jira/browse/NUTCH-2369
> Project: Nutch
>  Issue Type: Task
>  Components: crawldb, graphgenerator, hostdb, linkdb, segment, 
> storage, tool
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>  Labels: gsoc2017
> Fix For: 1.14
>
>
> I've been thinking for quite some time now that a new Tool which writes Nutch 
> data out as full graph data would be an excellent addition to the codebase.
> My thoughts involves writing data using Tinkerpop's ScriptInputFormat and 
> ScriptOutputFormat's to create Vertex objects representing Nutch Crawl 
> Records. 
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html
> http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html
> I envisage that each Vertex object would require the CrawlDB, LinkDB a 
> Segment and possibly the HostDB in order to be fully populated. Graph 
> characteristics e.g. Edge's would comes from those existing data structures 
> as well.
> It is my intention to propose this as a GSoC project for 2017 and I have 
> already talked offline with a potential student [~omkar20895] about him 
> participating as the student.
> Essentially, if we were able to create a Graph enabling true traversal, this 
> could be a game changer for how Nutch Crawl data is interpreted. It is my 
> feeling that this issue most likely also involved an entire upgrade of the 
> Hadoop API's from mapred to mapreduce for the master codebase.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-16 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929423#comment-15929423
 ] 

Omkar Reddy commented on NUTCH-2366:


Yes, this is my first patch [~lewismc]

> Deprecated Job constructor in hostdb/ReadHostDb.java
> 
>
> Key: NUTCH-2366
> URL: https://issues.apache.org/jira/browse/NUTCH-2366
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.12
>Reporter: Omkar Reddy
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2366.patch
>
>
> When we try to build ant using nutch we get the following warning : 
> warning: [deprecation] Job(Configuration,String) in Job has been deprecated
>[javac] Job job = new Job(conf, "ReadHostDb");
> This is because the constructor Job(Configuration conf, String jobName) has 
> been deprecated and the reference can be found at [0].
> [0] 
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-15 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926746#comment-15926746
 ] 

Omkar Reddy commented on NUTCH-2366:


Thank you very much [~markus17]

> Deprecated Job constructor in hostdb/ReadHostDb.java
> 
>
> Key: NUTCH-2366
> URL: https://issues.apache.org/jira/browse/NUTCH-2366
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.12
>Reporter: Omkar Reddy
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2366.patch
>
>
> When we try to build ant using nutch we get the following warning : 
> warning: [deprecation] Job(Configuration,String) in Job has been deprecated
>[javac] Job job = new Job(conf, "ReadHostDb");
> This is because the constructor Job(Configuration conf, String jobName) has 
> been deprecated and the reference can be found at [0].
> [0] 
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-11 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906265#comment-15906265
 ] 

Omkar Reddy commented on NUTCH-2366:


Hi [~markus17], Do I need to send a pull request to the git repo or is the 
patch enough? Thanks. 

> Deprecated Job constructor in hostdb/ReadHostDb.java
> 
>
> Key: NUTCH-2366
> URL: https://issues.apache.org/jira/browse/NUTCH-2366
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.12
>Reporter: Omkar Reddy
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2366.patch
>
>
> When we try to build ant using nutch we get the following warning : 
> warning: [deprecation] Job(Configuration,String) in Job has been deprecated
>[javac] Job job = new Job(conf, "ReadHostDb");
> This is because the constructor Job(Configuration conf, String jobName) has 
> been deprecated and the reference can be found at [0].
> [0] 
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-10 Thread Omkar Reddy (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Omkar Reddy updated NUTCH-2366:
---
Attachment: NUTCH-2366.patch

> Deprecated Job constructor in hostdb/ReadHostDb.java
> 
>
> Key: NUTCH-2366
> URL: https://issues.apache.org/jira/browse/NUTCH-2366
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.2.1, 1.12
>Reporter: Omkar Reddy
>Priority: Minor
> Attachments: NUTCH-2366.patch
>
>
> When we try to build ant using nutch we get the following warning : 
> warning: [deprecation] Job(Configuration,String) in Job has been deprecated
>[javac] Job job = new Job(conf, "ReadHostDb");
> This is because the constructor Job(Configuration conf, String jobName) has 
> been deprecated and the reference can be found at [0].
> [0] 
> http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java

2017-03-10 Thread Omkar Reddy (JIRA)
Omkar Reddy created NUTCH-2366:
--

 Summary: Deprecated Job constructor in hostdb/ReadHostDb.java
 Key: NUTCH-2366
 URL: https://issues.apache.org/jira/browse/NUTCH-2366
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.12, 2.2.1
Reporter: Omkar Reddy
Priority: Minor


When we try to build ant using nutch we get the following warning : 
warning: [deprecation] Job(Configuration,String) in Job has been deprecated
   [javac] Job job = new Job(conf, "ReadHostDb");

This is because the constructor Job(Configuration conf, String jobName) has 
been deprecated and the reference can be found at [0].

[0] 
http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (NUTCH-2361) Deprecated nutch and solr integration documentation.

2017-02-21 Thread Omkar Reddy (JIRA)
Omkar Reddy created NUTCH-2361:
--

 Summary: Deprecated nutch and solr integration documentation.
 Key: NUTCH-2361
 URL: https://issues.apache.org/jira/browse/NUTCH-2361
 Project: Nutch
  Issue Type: Improvement
  Components: documentation
Reporter: Omkar Reddy
Priority: Trivial


I think the documentation here [0] is outdated and needs to be updated as 
solr's latest documentation[1] points solr has both managed schema and classic 
schema which can be used accordingly. 

[0] https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch
[1] 
https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (NUTCH-2309) Scoring-Similarity Plugin raises NullPointerException when error occurs in fetching URL

2017-02-02 Thread Omkar Reddy (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850238#comment-15850238
 ] 

Omkar Reddy commented on NUTCH-2309:


Hi [~jxihong], I tried to reproduce this error but I was unable to do so. Can 
you please provide more insights regarding this? Thanks.

> Scoring-Similarity Plugin raises NullPointerException when error occurs in 
> fetching URL
> ---
>
> Key: NUTCH-2309
> URL: https://issues.apache.org/jira/browse/NUTCH-2309
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, scoring
>Affects Versions: 1.12
>Reporter: Joey Hong
>Priority: Trivial
>  Labels: easyfix
> Fix For: 1.13
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> When the Scoring-Similarity plugin is enabled, a NullPointerException is 
> thrown, cancelling the crawl, when computing the Cosine Similarity for URLs 
> where any kind of error occurred in fetching it. 
> The error occurs in line 77 in CosineSimilarity.java:
> float score = 
> Float.parseFloat(parseData.getContentMeta().get(Nutch.SCORE_KEY));
> This is probably because Nutch.SCORE_KEY is null for such URLs. It can be 
> easily fixed by setting a default value for score.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)