[jira] [Commented] (NUTCH-1084) ReadDB url throws exception
[ https://issues.apache.org/jira/browse/NUTCH-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741933#comment-14741933 ] Nadeem Douba commented on NUTCH-1084: - I think I found the issue and I don't think it's related to Nutch. AbstractMapWritable uses the Class.forName method which throws the CNFE. This is because Class.forName uses the system class loader which is different than the current thread's class loader in that it does not include the job jar as part of its class path. I recompiled hadoop-common to see if it would fix the issue by replacing the Class.forName call with Thread.currentThread().getContextClassLoader().loadClass(class). This seems to fix the issue. > ReadDB url throws exception > --- > > Key: NUTCH-1084 > URL: https://issues.apache.org/jira/browse/NUTCH-1084 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.3 >Reporter: Markus Jelsma >Assignee: Markus Jelsma > Attachments: NUTCH-1084.patch > > > Readdb -url suffers from two problems: > 1. it trips over the _SUCCESS file generated by newer Hadoop version > 2. throws can't find class: org.apache.nutch.protocol.ProtocolStatus (???) > The first problem can be remedied by not allowing the injector or updater to > write the _SUCCESS file. Until now that's the solution implemented for > similar issues. I've not been successful as to make the Hadoop readers simply > skip the file. > The second issue seems a bit strange and did not happen on a local check out. > I'm not yet sure whether this is a Hadoop issue or something being corrupt in > the CrawlDB. Here's the stack trace: > {code} > Exception in thread "main" java.io.IOException: can't find class: > org.apache.nutch.protocol.ProtocolStatus because > org.apache.nutch.protocol.ProtocolStatus > at > org.apache.hadoop.io.AbstractMapWritable.readFields(AbstractMapWritable.java:204) > at org.apache.hadoop.io.MapWritable.readFields(MapWritable.java:146) > at org.apache.nutch.crawl.CrawlDatum.readFields(CrawlDatum.java:278) > at > org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:1751) > at org.apache.hadoop.io.MapFile$Reader.get(MapFile.java:524) > at > org.apache.hadoop.mapred.MapFileOutputFormat.getEntry(MapFileOutputFormat.java:105) > at org.apache.nutch.crawl.CrawlDbReader.get(CrawlDbReader.java:383) > at > org.apache.nutch.crawl.CrawlDbReader.readUrl(CrawlDbReader.java:389) > at org.apache.nutch.crawl.CrawlDbReader.main(CrawlDbReader.java:514) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:156) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741901#comment-14741901 ] Chris A. Mattmann commented on NUTCH-2094: -- no problem just switch to branch-2.3 (should be a branch in the Github repo) > When stopping a crawl in Nutch 2.3, I was having trouble when I start an > already stopped crawl and then stop it again. > --- > > Key: NUTCH-2094 > URL: https://issues.apache.org/jira/browse/NUTCH-2094 > Project: Nutch > Issue Type: Bug >Reporter: Prerna Satija >Assignee: Chris A. Mattmann > > I have created a stop button in Nutch webapp to stop a running crawl from the > UI on click of a "stop" button. While testing, I found that I am able to stop > a crawl successfully but when I restart a stopped crawl and try to stop it, > it doesn't stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2096) Explicitly indicate broswer binary to use when selecting selenium remote option in config
[ https://issues.apache.org/jira/browse/NUTCH-2096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741693#comment-14741693 ] Kim Whitehall commented on NUTCH-2096: -- I can't seem to figure out how to assign this task to myself. Anyhows, I'm working on it. > Explicitly indicate broswer binary to use when selecting selenium remote > option in config > - > > Key: NUTCH-2096 > URL: https://issues.apache.org/jira/browse/NUTCH-2096 > Project: Nutch > Issue Type: Improvement > Components: plugin >Reporter: Kim Whitehall >Priority: Minor > Fix For: 1.11 > > > When using the selenium grid, not defining the binary version on nodes that > have multiple versions of browsers can lead to errors. > The solution proposed is to extend the DesiredCapabilities capabilities > provided in the "remote" case of > $NUTCH_HOME/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java > provided in NUTCH-2083 to explicitly indicate the browser path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2096) Explicitly indicate broswer binary to use when selecting selenium remote option in config
Kim Whitehall created NUTCH-2096: Summary: Explicitly indicate broswer binary to use when selecting selenium remote option in config Key: NUTCH-2096 URL: https://issues.apache.org/jira/browse/NUTCH-2096 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Kim Whitehall Priority: Minor Fix For: 1.11 When using the selenium grid, not defining the binary version on nodes that have multiple versions of browsers can lead to errors. The solution proposed is to extend the DesiredCapabilities capabilities provided in the "remote" case of $NUTCH_HOME/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java provided in NUTCH-2083 to explicitly indicate the browser path. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper
[ https://issues.apache.org/jira/browse/NUTCH-2095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jorge Luis Betancourt Gonzalez updated NUTCH-2095: -- Attachment: NUTCH-2095.patch > WARC exporter for the CommonCrawlDataDumper > --- > > Key: NUTCH-2095 > URL: https://issues.apache.org/jira/browse/NUTCH-2095 > Project: Nutch > Issue Type: Improvement > Components: commoncrawl, tool >Affects Versions: 1.11 >Reporter: Jorge Luis Betancourt Gonzalez >Priority: Minor > Labels: tools, warc > Attachments: NUTCH-2095.patch > > > Adds the possibility of exporting the nutch segments to a WARC files. > From the usage point of view a couple of new command line options are > available: > {{-warc}}: enables the functionality to export into WARC files, if not > specified the default JACKSON formatter is used. > {{-warcSize}}: enable the option to define a max file size for each WARC > file, if not specified a default of 1GB per file is used as recommended by > the WARC ISO standard. > The usual {{-gzip}} flag can be used to enable compression on the WARC files. > Some changes to the default {{CommonCrawlDataDumper}} were done, essentially > some changes to the Factory and to the Formats. This changes avoid creating a > new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2095) WARC exporter for the CommonCrawlDataDumper
Jorge Luis Betancourt Gonzalez created NUTCH-2095: - Summary: WARC exporter for the CommonCrawlDataDumper Key: NUTCH-2095 URL: https://issues.apache.org/jira/browse/NUTCH-2095 Project: Nutch Issue Type: Improvement Components: commoncrawl, tool Affects Versions: 1.11 Reporter: Jorge Luis Betancourt Gonzalez Priority: Minor Adds the possibility of exporting the nutch segments to a WARC files. >From the usage point of view a couple of new command line options are >available: {{-warc}}: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used. {{-warcSize}}: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard. The usual {{-gzip}} flag can be used to enable compression on the WARC files. Some changes to the default {{CommonCrawlDataDumper}} were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a {{CommmonCrawlFormat}} on each URL read from the segments. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741384#comment-14741384 ] Prerna Satija commented on NUTCH-2094: -- Hi [~chrismattmann] I opened the git link that you shared. But the nutch repository in the clone link is for nutch 1.0 but my fix is for a bug in the Nutch 2.3 version. Can you send the clone link for nutch 2.3 ? > When stopping a crawl in Nutch 2.3, I was having trouble when I start an > already stopped crawl and then stop it again. > --- > > Key: NUTCH-2094 > URL: https://issues.apache.org/jira/browse/NUTCH-2094 > Project: Nutch > Issue Type: Bug >Reporter: Prerna Satija >Assignee: Chris A. Mattmann > > I have created a stop button in Nutch webapp to stop a running crawl from the > UI on click of a "stop" button. While testing, I found that I am able to stop > a crawl successfully but when I restart a stopped crawl and try to stop it, > it doesn't stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14741292#comment-14741292 ] Chris A. Mattmann commented on NUTCH-2094: -- Hi [~prernasatija] would you be willing to submit a Pull Request/Patch for this per http://github.com/apache/nutch/#contributing (for this issue?) I would be happy to commit it. > When stopping a crawl in Nutch 2.3, I was having trouble when I start an > already stopped crawl and then stop it again. > --- > > Key: NUTCH-2094 > URL: https://issues.apache.org/jira/browse/NUTCH-2094 > Project: Nutch > Issue Type: Bug >Reporter: Prerna Satija >Assignee: Chris A. Mattmann > > I have created a stop button in Nutch webapp to stop a running crawl from the > UI on click of a "stop" button. While testing, I found that I am able to stop > a crawl successfully but when I restart a stopped crawl and try to stop it, > it doesn't stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2094 started by Chris A. Mattmann. > When stopping a crawl in Nutch 2.3, I was having trouble when I start an > already stopped crawl and then stop it again. > --- > > Key: NUTCH-2094 > URL: https://issues.apache.org/jira/browse/NUTCH-2094 > Project: Nutch > Issue Type: Bug >Reporter: Prerna Satija >Assignee: Chris A. Mattmann > > I have created a stop button in Nutch webapp to stop a running crawl from the > UI on click of a "stop" button. While testing, I found that I am able to stop > a crawl successfully but when I restart a stopped crawl and try to stop it, > it doesn't stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reopened NUTCH-2094: -- Assignee: Chris A. Mattmann > When stopping a crawl in Nutch 2.3, I was having trouble when I start an > already stopped crawl and then stop it again. > --- > > Key: NUTCH-2094 > URL: https://issues.apache.org/jira/browse/NUTCH-2094 > Project: Nutch > Issue Type: Bug >Reporter: Prerna Satija >Assignee: Chris A. Mattmann > > I have created a stop button in Nutch webapp to stop a running crawl from the > UI on click of a "stop" button. While testing, I found that I am able to stop > a crawl successfully but when I restart a stopped crawl and try to stop it, > it doesn't stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.
[ https://issues.apache.org/jira/browse/NUTCH-2094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Prerna Satija resolved NUTCH-2094. -- Resolution: Fixed I fixed this issue on line 57 of NutchServerPoolExecutor.java. Instead of this line runningWorkers.remove(((JobWorker) runnable).getInfo()); I have put runningWorkers.remove(((JobWorker) runnable)); This was a bug in Nutch 2.3 code as runningWorkers is a queue of JobWorker type so only an object of type JobWorker should be removed from the queue and not jobWorker.getInfo() because that will remove JobInfo type of object from runningWorkers queue. > When stopping a crawl in Nutch 2.3, I was having trouble when I start an > already stopped crawl and then stop it again. > --- > > Key: NUTCH-2094 > URL: https://issues.apache.org/jira/browse/NUTCH-2094 > Project: Nutch > Issue Type: Bug >Reporter: Prerna Satija > > I have created a stop button in Nutch webapp to stop a running crawl from the > UI on click of a "stop" button. While testing, I found that I am able to stop > a crawl successfully but when I restart a stopped crawl and try to stop it, > it doesn't stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (NUTCH-2094) When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again.
Prerna Satija created NUTCH-2094: Summary: When stopping a crawl in Nutch 2.3, I was having trouble when I start an already stopped crawl and then stop it again. Key: NUTCH-2094 URL: https://issues.apache.org/jira/browse/NUTCH-2094 Project: Nutch Issue Type: Bug Reporter: Prerna Satija I have created a stop button in Nutch webapp to stop a running crawl from the UI on click of a "stop" button. While testing, I found that I am able to stop a crawl successfully but when I restart a stopped crawl and try to stop it, it doesn't stop. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper
GitHub user jorgelbg reopened a pull request: https://github.com/apache/nutch/pull/55 WARC exporter for the CommonCrawlDataDumper This adds the possibility of exporting the nutch segments to a WARC files. From the usage point of view a couple of new command line options are available: * `-warc`: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used. * `-warcSize`: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard. The usual `-gzip` flag can be used to enable compression on the WARC files, which allow to compress the output files. Some changes to the default CommonCrawlDataDumper were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat on each URL read from the segments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/DigitalPebble/nutch warc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit 0a627e5a5098a2ad4818b594fe567ea7fdd2c131 Author: Jorge Luis Betancourt Date: 2015-09-08T13:21:04Z Initial version of the CommonCrawlWARCFormat, generates valid metadata, response and request records. The request records only provide partial information, roughly the same as the CommonCrawl Data Dumper at the moment. commit 1889a0b64d48005499f4de01ed18724087feb0f7 Author: Jorge Luis Betancourt Date: 2015-09-08T16:37:27Z Adding the WARCUtils class and the dependency to the ivy.xml file to avoid the fetching of another hadoop dependency commit 169e5a4a4172424b31c91e232bb69056b10827c7 Author: Jorge Luis Betancourt Date: 2015-09-08T18:21:47Z Removing the transitive property of the ivy.xml file to avoid any future troubles commit ede35d1aa767741ec5206de7990910fc661983e8 Author: Jorge Luis Betancourt Date: 2015-09-10T17:57:11Z Doing some refactoring on the existing code, essentially trying to avoid creating an instance of each CommonCrawlFormat per URL processed, since the format is content indepdendent at the momento the factory should allow to create a format without this data. Added a close method to the the CommonCrawlFormat interface for those cases when the format needs some closing statement. commit 44beb74172364556f70b6f08d0a8ee511c99eff4 Author: Jorge Luis Betancourt Date: 2015-09-11T14:34:42Z Adding the changes to the main CCDataDumper class to call the WARC exporter tool. Changes to the Jackson format to work with the new structure. Changes to the FormatFactory to create the right Jacson/WARC instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper
Github user jorgelbg closed the pull request at: https://github.com/apache/nutch/pull/55 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] nutch pull request: WARC exporter for the CommonCrawlDataDumper
GitHub user jorgelbg opened a pull request: https://github.com/apache/nutch/pull/55 WARC exporter for the CommonCrawlDataDumper This adds the possibility of exporting the nutch segments to a WARC files. From the usage point of view a couple of new command line options are available: * `-warc`: enables the functionality to export into WARC files, if not specified the default JACKSON formatter is used. * `-warcSize`: enable the option to define a max file size for each WARC file, if not specified a default of 1GB per file is used as recommended by the WARC ISO standard. The usual `-gzip` flag can be used to enable compression on the WARC files, which allow to compress the output files. Some changes to the default CommonCrawlDataDumper were done, essentially some changes to the Factory and to the Formats. This changes avoid creating a new instance of a CommmonCrawlFormat on each URL read from the segments. You can merge this pull request into a Git repository by running: $ git pull https://github.com/DigitalPebble/nutch warc Alternatively you can review and apply these changes as the patch at: https://github.com/apache/nutch/pull/55.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #55 commit 0a627e5a5098a2ad4818b594fe567ea7fdd2c131 Author: Jorge Luis Betancourt Date: 2015-09-08T13:21:04Z Initial version of the CommonCrawlWARCFormat, generates valid metadata, response and request records. The request records only provide partial information, roughly the same as the CommonCrawl Data Dumper at the moment. commit 1889a0b64d48005499f4de01ed18724087feb0f7 Author: Jorge Luis Betancourt Date: 2015-09-08T16:37:27Z Adding the WARCUtils class and the dependency to the ivy.xml file to avoid the fetching of another hadoop dependency commit 169e5a4a4172424b31c91e232bb69056b10827c7 Author: Jorge Luis Betancourt Date: 2015-09-08T18:21:47Z Removing the transitive property of the ivy.xml file to avoid any future troubles commit ede35d1aa767741ec5206de7990910fc661983e8 Author: Jorge Luis Betancourt Date: 2015-09-10T17:57:11Z Doing some refactoring on the existing code, essentially trying to avoid creating an instance of each CommonCrawlFormat per URL processed, since the format is content indepdendent at the momento the factory should allow to create a format without this data. Added a close method to the the CommonCrawlFormat interface for those cases when the format needs some closing statement. commit 44beb74172364556f70b6f08d0a8ee511c99eff4 Author: Jorge Luis Betancourt Date: 2015-09-11T14:34:42Z Adding the changes to the main CCDataDumper class to call the WARC exporter tool. Changes to the Jackson format to work with the new structure. Changes to the FormatFactory to create the right Jacson/WARC instance. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Created] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator
Markus Jelsma created NUTCH-2093: Summary: Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator Key: NUTCH-2093 URL: https://issues.apache.org/jira/browse/NUTCH-2093 Project: Nutch Issue Type: Bug Components: indexer Affects Versions: 1.10 Reporter: Markus Jelsma Priority: Minor Fix For: 1.11 Attachments: NUTCH-2093.patch In IndexerMapReduce, a fetchDatum is passed to the indexing filters. However, when this fetchDatum was created via FreeGenerator, it has no signature attached, and indexing filters don't see it. This patch copies the signature from the dbDatum just before passed to indexing filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2093) Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator
[ https://issues.apache.org/jira/browse/NUTCH-2093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2093: - Attachment: NUTCH-2093.patch Patch for trunk. > Indexing filters have no signature in CrawlDatum if crawled via FreeGenerator > - > > Key: NUTCH-2093 > URL: https://issues.apache.org/jira/browse/NUTCH-2093 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.10 >Reporter: Markus Jelsma >Priority: Minor > Fix For: 1.11 > > Attachments: NUTCH-2093.patch > > > In IndexerMapReduce, a fetchDatum is passed to the indexing filters. However, > when this fetchDatum was created via FreeGenerator, it has no signature > attached, and indexing filters don't see it. > This patch copies the signature from the dbDatum just before passed to > indexing filters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)