[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid
[ https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509469#comment-16509469 ] Omkar Reddy commented on NUTCH-2557: A simple and wise solution. Thanks. > protocol-http fails to follow redirections when an HTTP response body is > invalid > > > Key: NUTCH-2557 > URL: https://issues.apache.org/jira/browse/NUTCH-2557 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Major > Fix For: 1.15 > > > If a server sends a redirection (3XX status code, with a Location header), > protocol-http tries to parse the HTTP response body anyway. Thus, if an error > occurs while decoding the body, the redirection is not followed and the > information is lost. Browsers follow the redirection and close the socket > soon as they can. > * Example: this page is a redirection to its https version, with an HTTP > body containing invalidly gzip encoded contents. Browsers follow the > redirection, but nutch throws an error: > ** [http://www.webarcelona.net/es/blog?page=2] > > The HttpResponse::getContent class can already return null. I think it should > at least return null when parsing the HTTP response body fails. > Ideally, we would adopt the same behavior as browsers, and not even try > parsing the body when the headers indicate a redirection. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2557) protocol-http fails to follow redirections when an HTTP response body is invalid
[ https://issues.apache.org/jira/browse/NUTCH-2557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16490581#comment-16490581 ] Omkar Reddy commented on NUTCH-2557: I agree, sometimes the http body of bad requests and redirects might contain some kind of diagnostic information that might be helpful to the user. So we should store it optionally. Can we add the property as http.content.store.3XX.404? or is it a complicated name for a property? > protocol-http fails to follow redirections when an HTTP response body is > invalid > > > Key: NUTCH-2557 > URL: https://issues.apache.org/jira/browse/NUTCH-2557 > Project: Nutch > Issue Type: Sub-task >Reporter: Gerard Bouchar >Priority: Major > > If a server sends a redirection (3XX status code, with a Location header), > protocol-http tries to parse the HTTP response body anyway. Thus, if an error > occurs while decoding the body, the redirection is not followed and the > information is lost. Browsers follow the redirection and close the socket > soon as they can. > * Example: this page is a redirection to its https version, with an HTTP > body containing invalidly gzip encoded contents. Browsers follow the > redirection, but nutch throws an error: > ** [http://www.webarcelona.net/es/blog?page=2] > > The HttpResponse::getContent class can already return null. I think it should > at least return null when parsing the HTTP response body fails. > Ideally, we would adopt the same behavior as browsers, and not even try > parsing the body when the headers indicate a redirection. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16488900#comment-16488900 ] Omkar Reddy commented on NUTCH-2575: I have taken up [NUTCH-2557|https://issues.apache.org/jira/browse/NUTCH-2557] and started working on it. Thanks. > protocol-http does not respect the maximum content-size for chunked responses > - > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task > Components: protocol >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Critical > Fix For: 1.15 > > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16465037#comment-16465037 ] Omkar Reddy commented on NUTCH-2575: Hi [~gbouchar], I see the issue, while reading every chunk we are calculating the number of bytes read in the chunk with the variable: "chunkBytesRead" but it is not added into the "contentBytesRead" after reading the chunk. A simple solution is to do "contentBytesRead += chunkBytesRead" at the end of every chunk. This is should fix it. I will send a PR for this. Thanks. > protocol-http does not respect the maximum content-size > --- > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task >Reporter: Gerard Bouchar >Priority: Critical > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2553) Fetcher not to modify URLs to be fetched
[ https://issues.apache.org/jira/browse/NUTCH-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439219#comment-16439219 ] Omkar Reddy commented on NUTCH-2553: WOW! :O I couldn't have figured that one on my own. Thanks [~wastl-nagel] > Fetcher not to modify URLs to be fetched > > > Key: NUTCH-2553 > URL: https://issues.apache.org/jira/browse/NUTCH-2553 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.15 > > > Fetcher modifies the URLs being fetched (introduced with NUTCH-2375 in > [c93d908|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#diff-847479d08597eb30da1c715310438685R253]: > {noformat} > FetcherThread 22 fetching http://nutch.apache.org:-1/ (queue crawl > delay=5000ms) > {noformat} > which makes it hard to trace the URLs in the log files and likely causes > other issues because URLs in CrawlDb and segments do not match > (http://nutch.apache.org/ in CrawlDb and http://nutch.apache.org:-1/ in > segment). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2551) NullPointerException in generator
[ https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16439208#comment-16439208 ] Omkar Reddy commented on NUTCH-2551: Hello [~wastl-nagel] I used Hadoop-2.7.4, I will try and reproduce it using Hadoop-2.8.3 just as an exercise. Thanks for letting me know. I used 2.7.4 as the ivy configuration at [nutch|https://github.com/apache/nutch]/[ivy|https://github.com/apache/nutch/tree/master/ivy]/*ivy.xml* uses the mentioned version. [~HansBrende] thank you for the explanation and the patch. :) > NullPointerException in generator > - > > Key: NUTCH-2551 > URL: https://issues.apache.org/jira/browse/NUTCH-2551 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.15 >Reporter: Hans Brende >Assignee: Sebastian Nagel >Priority: Blocker > Fix For: 1.15 > > > A NullPointerException is thrown during the crawl generate stage when I > deploy to a hadoop cluster (but for some reason, it works fine locally). > It looks like this is caused because the URLPartitioner class still has the > old {{configure()}} method in there (which is never called, causing the > {{normalizers}} field to remain null), rather than implementing the > {{Configurable}} interface as detailed in the newer mapreduce API's > Partitioner spec. > Stack trace: > {code} > java.lang.NullPointerException > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76) > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > {code} > > Oh and it might also be because a *static* URLPartitioner instance is being > used in the Generator.Selector class... but it's only initialized in the > {{setup()}} method of the Generator.Selector.SelectorMapper class! So that > whole setup looks pretty weird... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2551) NullPointerException in generator
[ https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16433705#comment-16433705 ] Omkar Reddy commented on NUTCH-2551: [~wastl-nagel], [~HansBrende], [~lewi...@apache.org] please let me know your thoughts on my explanation above so that I can send a PR with the fix. Thanks. > NullPointerException in generator > - > > Key: NUTCH-2551 > URL: https://issues.apache.org/jira/browse/NUTCH-2551 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.15 >Reporter: Hans Brende >Priority: Blocker > Fix For: 1.15 > > > A NullPointerException is thrown during the crawl generate stage when I > deploy to a hadoop cluster (but for some reason, it works fine locally). > It looks like this is caused because the URLPartitioner class still has the > old {{configure()}} method in there (which is never called, causing the > {{normalizers}} field to remain null), rather than implementing the > {{Configurable}} interface as detailed in the newer mapreduce API's > Partitioner spec. > Stack trace: > {code} > java.lang.NullPointerException > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76) > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > {code} > > Oh and it might also be because a *static* URLPartitioner instance is being > used in the Generator.Selector class... but it's only initialized in the > {{setup()}} method of the Generator.Selector.SelectorMapper class! So that > whole setup looks pretty weird... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2551) NullPointerException in generator
[ https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431997#comment-16431997 ] Omkar Reddy commented on NUTCH-2551: Hi, [~HansBrende] I tried reproducing the error in pseudo-distributed mode on my local machine and I was unable to. Are there any specific steps to reproduce or just run a general crawl cycle? Thanks. > NullPointerException in generator > - > > Key: NUTCH-2551 > URL: https://issues.apache.org/jira/browse/NUTCH-2551 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.15 >Reporter: Hans Brende >Priority: Blocker > Fix For: 1.15 > > > A NullPointerException is thrown during the crawl generate stage when I > deploy to a hadoop cluster (but for some reason, it works fine locally). > It looks like this is caused because the URLPartitioner class still has the > old {{configure()}} method in there (which is never called, causing the > {{normalizers}} field to remain null), rather than implementing the > {{Configurable}} interface as detailed in the newer mapreduce API's > Partitioner spec. > Stack trace: > {code} > java.lang.NullPointerException > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76) > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > {code} > > Oh and it might also be because a *static* URLPartitioner instance is being > used in the Generator.Selector class... but it's only initialized in the > {{setup()}} method of the Generator.Selector.SelectorMapper class! So that > whole setup looks pretty weird... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2553) Fetcher not to modify URLs to be fetched
[ https://issues.apache.org/jira/browse/NUTCH-2553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431801#comment-16431801 ] Omkar Reddy commented on NUTCH-2553: [~wastl-nagel] I did not add anything that produces this specific change, this might be a result of some implementation change that I did in NUTCH-2375. I will find the root cause and let you know. Thanks. > Fetcher not to modify URLs to be fetched > > > Key: NUTCH-2553 > URL: https://issues.apache.org/jira/browse/NUTCH-2553 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.15 > > > Fetcher modifies the URLs being fetched (introduced with NUTCH-2375 in > [c93d908|https://github.com/apache/nutch/commit/c93d908bb635d3c5b59f8c8a22e0584ebf588794#diff-847479d08597eb30da1c715310438685R253]: > {noformat} > FetcherThread 22 fetching http://nutch.apache.org:-1/ (queue crawl > delay=5000ms) > {noformat} > which makes it hard to trace the URLs in the log files and likely causes > other issues because URLs in CrawlDb and segments do not match > (http://nutch.apache.org/ in CrawlDb and http://nutch.apache.org:-1/ in > segment). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2551) NullPointerException in generator
[ https://issues.apache.org/jira/browse/NUTCH-2551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431798#comment-16431798 ] Omkar Reddy commented on NUTCH-2551: I think the issue here is that a new job is(job.getInstance) being created in the setup() of GeneratorSelectorMapper and that job is being passed when we are configuring the partitioner. This might be the reason for the configuration being lost and hence the nullPointerException. I don't know why I created a new job in that patch(NUTCH-2375) rather than just passing the configuration object to URLPartitioner.configure() method, my bad. This is a quick fix and I will send a PR. Thanks. > NullPointerException in generator > - > > Key: NUTCH-2551 > URL: https://issues.apache.org/jira/browse/NUTCH-2551 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.15 >Reporter: Hans Brende >Priority: Blocker > Fix For: 1.15 > > > A NullPointerException is thrown during the crawl generate stage when I > deploy to a hadoop cluster (but for some reason, it works fine locally). > It looks like this is caused because the URLPartitioner class still has the > old {{configure()}} method in there (which is never called, causing the > {{normalizers}} field to remain null), rather than implementing the > {{Configurable}} interface as detailed in the newer mapreduce API's > Partitioner spec. > Stack trace: > {code} > java.lang.NullPointerException > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:76) > at org.apache.nutch.crawl.URLPartitioner.getPartition(URLPartitioner.java:40) > at > org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716) > at > org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:553) > at > org.apache.nutch.crawl.Generator$SelectorInverseMapper.map(Generator.java:546) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:175) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:169) > {code} > > Oh and it might also be because a *static* URLPartitioner instance is being > used in the Generator.Selector class... but it's only initialized in the > {{setup()}} method of the Generator.Selector.SelectorMapper class! So that > whole setup looks pretty weird... -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()
[ https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415426#comment-16415426 ] Omkar Reddy commented on NUTCH-2518: I might have missed this ticket. Hi [~wastl-nagel], this was not covered in my PR for: [NUTCH-2375|https://github.com/apache/nutch/pull/221]. [~kpm1985], [~wastl-nagel], [~lewismc] I see there is a PR with just a minor change for this issue. I can take it up if it is not a problem. Please let me know anyways. Thanks. > Must check return value of job.waitForCompletion() > -- > > Key: NUTCH-2518 > URL: https://issues.apache.org/jira/browse/NUTCH-2518 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher, generator, hostdb, linkdb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Assignee: Kenneth McFarland >Priority: Blocker > Fix For: 1.15 > > > The return value of job.waitForCompletion() of the new MapReduce API > (NUTCH-2375) must always be checked. If it's not true, the job has been > failed or killed. Accordingly, the program > - should not proceed with further jobs/steps > - must clean-up temporary data, unlock CrawlDB, etc. > - exit with non-zero exit value, so that scripts running the crawl workflow > can handle the failure > Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR > #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2383) Wrong FS exception in Fetcher
[ https://issues.apache.org/jira/browse/NUTCH-2383?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244019#comment-16244019 ] Omkar Reddy commented on NUTCH-2383: I recently faced this issue, we need to set the property: mapreduce.framework.name in mapred-site.xml as mentioned here [0] while configuring the cluster. [0] https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/SingleCluster.html > Wrong FS exception in Fetcher > - > > Key: NUTCH-2383 > URL: https://issues.apache.org/jira/browse/NUTCH-2383 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.13 > Environment: Hadoop 2.8 and Hadoop 2.7.2 >Reporter: Yossi Tamari > Attachments: crawl output.txt > > > Running bin/crawl on either Hadoop 2.7.2 or Hadoop 2.8, the Injector and > Generator succeed, but the Fetcher throws: > {code}java.lang.IllegalArgumentException: Wrong FS: > hdfs://localhost:9000/user/root/crawl/segments/20170430084337/crawl_fetch, > expected: file:///{code}. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2442) Injector to stop if job fails to avoid loss of CrawlDb
[ https://issues.apache.org/jira/browse/NUTCH-2442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16238278#comment-16238278 ] Omkar Reddy commented on NUTCH-2442: [~wastl-nagel] I am working on this on my local branch of NUTCH-2375. Just so that I do not head in the wrong direction, I was thinking the fix should be in the following manner : try{ boolean complete = job.waitForCompletion(true); if(!complete){ ''' cleanup statements to revert any significant changes that happened during or before the job.''' throw new Exception(" FAILED."); } }catch(Exception e){ throw e; } Please let me know if I need to add anything else or if there is any discrepancy in what I am doing above. Thanks. > Injector to stop if job fails to avoid loss of CrawlDb > -- > > Key: NUTCH-2442 > URL: https://issues.apache.org/jira/browse/NUTCH-2442 > Project: Nutch > Issue Type: Bug > Components: injector >Affects Versions: 1.13 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.14 > > > Injector does not check whether the MapReduce job is successful. Even if the > job fails > - installs the CrawlDb > -- move current/ to old/ > -- replace current/ with an empty or potentially incomplete version > - exits with code 0 so that scripts running the crawl workflow cannot detect > the failure -- if Injector is run a second time the CrawlDb is lost (both > current/ and old/ are empty or corrupted) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (NUTCH-2427) Remove all the Hadoop wildcard imports.
[ https://issues.apache.org/jira/browse/NUTCH-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Reddy updated NUTCH-2427: --- Labels: easyfix (was: ) > Remove all the Hadoop wildcard imports. > --- > > Key: NUTCH-2427 > URL: https://issues.apache.org/jira/browse/NUTCH-2427 > Project: Nutch > Issue Type: Improvement > Components: build >Reporter: Omkar Reddy >Priority: Minor > Labels: easyfix > > This improvement deals with removing the wildcard imports like "import > org.apache.hadoop.package.* " -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (NUTCH-2427) Remove all the Hadoop wildcard imports.
Omkar Reddy created NUTCH-2427: -- Summary: Remove all the Hadoop wildcard imports. Key: NUTCH-2427 URL: https://issues.apache.org/jira/browse/NUTCH-2427 Project: Nutch Issue Type: Improvement Components: build Reporter: Omkar Reddy Priority: Minor This improvement deals with removing the wildcard imports like "import org.apache.hadoop.package.* " -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
[ https://issues.apache.org/jira/browse/NUTCH-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15986881#comment-15986881 ] Omkar Reddy commented on NUTCH-2375: Hello dev@, I am using the following url : https://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api, to upgrade the codebase. Please post on this thread if there is any discrepancy in the ppt in the above link. Thanks, Omkar. > Upgrade the code base from org.apache.hadoop.mapred to > org.apache.hadoop.mapreduce > -- > > Key: NUTCH-2375 > URL: https://issues.apache.org/jira/browse/NUTCH-2375 > Project: Nutch > Issue Type: Improvement > Components: deployment >Reporter: Omkar Reddy > > Nutch is still using the deprecated org.apache.hadoop.mapred dependency which > has been deprecated. It need to be updated to org.apache.hadoop.mapreduce > dependency. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2375) Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce
Omkar Reddy created NUTCH-2375: -- Summary: Upgrade the code base from org.apache.hadoop.mapred to org.apache.hadoop.mapreduce Key: NUTCH-2375 URL: https://issues.apache.org/jira/browse/NUTCH-2375 Project: Nutch Issue Type: Improvement Components: deployment Reporter: Omkar Reddy Nutch is still using the deprecated org.apache.hadoop.mapred dependency which has been deprecated. It need to be updated to org.apache.hadoop.mapreduce dependency. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2372) Javadocs build failing.
Omkar Reddy created NUTCH-2372: -- Summary: Javadocs build failing. Key: NUTCH-2372 URL: https://issues.apache.org/jira/browse/NUTCH-2372 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.13, 2.2.1 Reporter: Omkar Reddy Priority: Minor When we build javadocs of nutch using the command : "ant javadoc" we get a handful of errors and the build fails. This is because up to JDK 7, the Javadoc tool was pretty lenient. With JDK 8, a new part has been added to Javadoc called doclint and it changes that friendly behaviour. Warnings turned out into errors with JDK 8. The error log can be found here : https://paste.apache.org/sVQ5 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2369) Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph
[ https://issues.apache.org/jira/browse/NUTCH-2369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929431#comment-15929431 ] Omkar Reddy commented on NUTCH-2369: Branch 1.x [~lewismc]. Thanks. > Create a new GraphGenerator Tool for writing Nutch Records as a Full Web Graph > -- > > Key: NUTCH-2369 > URL: https://issues.apache.org/jira/browse/NUTCH-2369 > Project: Nutch > Issue Type: Task > Components: crawldb, graphgenerator, hostdb, linkdb, segment, > storage, tool >Reporter: Lewis John McGibbney >Assignee: Lewis John McGibbney > Labels: gsoc2017 > Fix For: 1.14 > > > I've been thinking for quite some time now that a new Tool which writes Nutch > data out as full graph data would be an excellent addition to the codebase. > My thoughts involves writing data using Tinkerpop's ScriptInputFormat and > ScriptOutputFormat's to create Vertex objects representing Nutch Crawl > Records. > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptInputFormat.html > http://tinkerpop.apache.org/javadocs/current/full/index.html?org/apache/tinkerpop/gremlin/hadoop/structure/io/script/ScriptOutputFormat.html > I envisage that each Vertex object would require the CrawlDB, LinkDB a > Segment and possibly the HostDB in order to be fully populated. Graph > characteristics e.g. Edge's would comes from those existing data structures > as well. > It is my intention to propose this as a GSoC project for 2017 and I have > already talked offline with a potential student [~omkar20895] about him > participating as the student. > Essentially, if we were able to create a Graph enabling true traversal, this > could be a game changer for how Nutch Crawl data is interpreted. It is my > feeling that this issue most likely also involved an entire upgrade of the > Hadoop API's from mapred to mapreduce for the master codebase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15929423#comment-15929423 ] Omkar Reddy commented on NUTCH-2366: Yes, this is my first patch [~lewismc] > Deprecated Job constructor in hostdb/ReadHostDb.java > > > Key: NUTCH-2366 > URL: https://issues.apache.org/jira/browse/NUTCH-2366 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.12 >Reporter: Omkar Reddy >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2366.patch > > > When we try to build ant using nutch we get the following warning : > warning: [deprecation] Job(Configuration,String) in Job has been deprecated >[javac] Job job = new Job(conf, "ReadHostDb"); > This is because the constructor Job(Configuration conf, String jobName) has > been deprecated and the reference can be found at [0]. > [0] > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926746#comment-15926746 ] Omkar Reddy commented on NUTCH-2366: Thank you very much [~markus17] > Deprecated Job constructor in hostdb/ReadHostDb.java > > > Key: NUTCH-2366 > URL: https://issues.apache.org/jira/browse/NUTCH-2366 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.12 >Reporter: Omkar Reddy >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2366.patch > > > When we try to build ant using nutch we get the following warning : > warning: [deprecation] Job(Configuration,String) in Job has been deprecated >[javac] Job job = new Job(conf, "ReadHostDb"); > This is because the constructor Job(Configuration conf, String jobName) has > been deprecated and the reference can be found at [0]. > [0] > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15906265#comment-15906265 ] Omkar Reddy commented on NUTCH-2366: Hi [~markus17], Do I need to send a pull request to the git repo or is the patch enough? Thanks. > Deprecated Job constructor in hostdb/ReadHostDb.java > > > Key: NUTCH-2366 > URL: https://issues.apache.org/jira/browse/NUTCH-2366 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 1.12 >Reporter: Omkar Reddy >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.13 > > Attachments: NUTCH-2366.patch > > > When we try to build ant using nutch we get the following warning : > warning: [deprecation] Job(Configuration,String) in Job has been deprecated >[javac] Job job = new Job(conf, "ReadHostDb"); > This is because the constructor Job(Configuration conf, String jobName) has > been deprecated and the reference can be found at [0]. > [0] > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java
[ https://issues.apache.org/jira/browse/NUTCH-2366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Omkar Reddy updated NUTCH-2366: --- Attachment: NUTCH-2366.patch > Deprecated Job constructor in hostdb/ReadHostDb.java > > > Key: NUTCH-2366 > URL: https://issues.apache.org/jira/browse/NUTCH-2366 > Project: Nutch > Issue Type: Bug > Components: build >Affects Versions: 2.2.1, 1.12 >Reporter: Omkar Reddy >Priority: Minor > Attachments: NUTCH-2366.patch > > > When we try to build ant using nutch we get the following warning : > warning: [deprecation] Job(Configuration,String) in Job has been deprecated >[javac] Job job = new Job(conf, "ReadHostDb"); > This is because the constructor Job(Configuration conf, String jobName) has > been deprecated and the reference can be found at [0]. > [0] > http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2366) Deprecated Job constructor in hostdb/ReadHostDb.java
Omkar Reddy created NUTCH-2366: -- Summary: Deprecated Job constructor in hostdb/ReadHostDb.java Key: NUTCH-2366 URL: https://issues.apache.org/jira/browse/NUTCH-2366 Project: Nutch Issue Type: Bug Components: build Affects Versions: 1.12, 2.2.1 Reporter: Omkar Reddy Priority: Minor When we try to build ant using nutch we get the following warning : warning: [deprecation] Job(Configuration,String) in Job has been deprecated [javac] Job job = new Job(conf, "ReadHostDb"); This is because the constructor Job(Configuration conf, String jobName) has been deprecated and the reference can be found at [0]. [0] http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html#getInstance%28org.apache.hadoop.conf.Configuration%29 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (NUTCH-2361) Deprecated nutch and solr integration documentation.
Omkar Reddy created NUTCH-2361: -- Summary: Deprecated nutch and solr integration documentation. Key: NUTCH-2361 URL: https://issues.apache.org/jira/browse/NUTCH-2361 Project: Nutch Issue Type: Improvement Components: documentation Reporter: Omkar Reddy Priority: Trivial I think the documentation here [0] is outdated and needs to be updated as solr's latest documentation[1] points solr has both managed schema and classic schema which can be used accordingly. [0] https://wiki.apache.org/nutch/NutchTutorial#Integrate_Solr_with_Nutch [1] https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (NUTCH-2309) Scoring-Similarity Plugin raises NullPointerException when error occurs in fetching URL
[ https://issues.apache.org/jira/browse/NUTCH-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15850238#comment-15850238 ] Omkar Reddy commented on NUTCH-2309: Hi [~jxihong], I tried to reproduce this error but I was unable to do so. Can you please provide more insights regarding this? Thanks. > Scoring-Similarity Plugin raises NullPointerException when error occurs in > fetching URL > --- > > Key: NUTCH-2309 > URL: https://issues.apache.org/jira/browse/NUTCH-2309 > Project: Nutch > Issue Type: Bug > Components: plugin, scoring >Affects Versions: 1.12 >Reporter: Joey Hong >Priority: Trivial > Labels: easyfix > Fix For: 1.13 > > Original Estimate: 48h > Remaining Estimate: 48h > > When the Scoring-Similarity plugin is enabled, a NullPointerException is > thrown, cancelling the crawl, when computing the Cosine Similarity for URLs > where any kind of error occurred in fetching it. > The error occurs in line 77 in CosineSimilarity.java: > float score = > Float.parseFloat(parseData.getContentMeta().get(Nutch.SCORE_KEY)); > This is probably because Nutch.SCORE_KEY is null for such URLs. It can be > easily fixed by setting a default value for score. -- This message was sent by Atlassian JIRA (v6.3.15#6346)