[jira] Created: (NUTCH-681) parse-mp3 compilation problem
parse-mp3 compilation problem - Key: NUTCH-681 URL: https://issues.apache.org/jira/browse/NUTCH-681 Project: Nutch Issue Type: Bug Components: indexer Environment: ubuntu, nutch-1.0-dev (trunk revision : 734360) Reporter: Wildan Maulana Fix For: 1.0.0 Due to API changes, the MP3 parser (which is not compiled by default due to licensing problem) doesn't compile anymore. compile: [echo] Compiling plugin: parse-mp3 [javac] Compiling 2 source files to /home/wildan/jobstuff/LIPI/Ngoprek/nutch/build/parse-mp3/classes [javac] /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:53: org.apache.nutch.parse.mp3.MP3Parser is not abstract and does not override abstract method getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser [javac] public class MP3Parser implements Parser { [javac]^ [javac] /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:58: getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.mp3.MP3Parser cannot implement getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser; attempting to use incompatible return type [javac] found : org.apache.nutch.parse.Parse [javac] required: org.apache.nutch.parse.ParseResult [javac] public Parse getParse(Content content) { [javac]^ [javac] /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java:54: cannot find symbol [javac] symbol : constructor Outlink(java.lang.String,java.lang.String,org.apache.hadoop.conf.Configuration) [javac] location: class org.apache.nutch.parse.Outlink [javac] links.add(new Outlink(value, , this.conf)); [javac] ^ [javac] Note: /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 3 errors BUILD FAILED /home/wildan/jobstuff/LIPI/Ngoprek/nutch/build.xml:113: The following error occurred while executing this line: /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build.xml:55: The following error occurred while executing this line: /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build-plugin.xml:111: Compile failed; see the compiler error output for details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-681) parse-mp3 compilation problem
[ https://issues.apache.org/jira/browse/NUTCH-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wildan Maulana updated NUTCH-681: - Attachment: MetadataCollector.java-compilation_issues.diff MP3Parser.java-compilation_issues.diff please re-check the patch that i have submitted above parse-mp3 compilation problem - Key: NUTCH-681 URL: https://issues.apache.org/jira/browse/NUTCH-681 Project: Nutch Issue Type: Bug Components: indexer Environment: ubuntu, nutch-1.0-dev (trunk revision : 734360) Reporter: Wildan Maulana Fix For: 1.0.0 Attachments: MetadataCollector.java-compilation_issues.diff, MP3Parser.java-compilation_issues.diff Due to API changes, the MP3 parser (which is not compiled by default due to licensing problem) doesn't compile anymore. compile: [echo] Compiling plugin: parse-mp3 [javac] Compiling 2 source files to /home/wildan/jobstuff/LIPI/Ngoprek/nutch/build/parse-mp3/classes [javac] /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:53: org.apache.nutch.parse.mp3.MP3Parser is not abstract and does not override abstract method getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser [javac] public class MP3Parser implements Parser { [javac]^ [javac] /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:58: getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.mp3.MP3Parser cannot implement getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser; attempting to use incompatible return type [javac] found : org.apache.nutch.parse.Parse [javac] required: org.apache.nutch.parse.ParseResult [javac] public Parse getParse(Content content) { [javac]^ [javac] /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java:54: cannot find symbol [javac] symbol : constructor Outlink(java.lang.String,java.lang.String,org.apache.hadoop.conf.Configuration) [javac] location: class org.apache.nutch.parse.Outlink [javac] links.add(new Outlink(value, , this.conf)); [javac] ^ [javac] Note: /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java uses unchecked or unsafe operations. [javac] Note: Recompile with -Xlint:unchecked for details. [javac] 3 errors BUILD FAILED /home/wildan/jobstuff/LIPI/Ngoprek/nutch/build.xml:113: The following error occurred while executing this line: /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build.xml:55: The following error occurred while executing this line: /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build-plugin.xml:111: Compile failed; see the compiler error output for details. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
2009/1/20 Piotr Kosiorowski pkosiorow...@gmail.com: pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney
[jira] Closed: (NUTCH-572) Scoring and redirected Urls
[ https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-572. --- Resolution: Invalid As Dennis suggested, I am closing this issue as Invalid. Scoring and redirected Urls --- Key: NUTCH-572 URL: https://issues.apache.org/jira/browse/NUTCH-572 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 When a redirect is found for a given url, the new or end url is stored as the content page and the old CrawlDatum get one of a few redirect codes. The page that gets indexed in Nutch is the end page and it gets indexed under the end url. Many times a site will have a significant number of links pointing to start page and very few pointing to the redirected end page. This is especially true for external links. Opic scores do not get transfered to the end page but stay with the start page (the one doing the redirecting). But the start page doesn't get indexed. Hence the end page will show up in the index but under a usually much reduced score. A good example of this is cnn.com: URL: http://www.cnn.com/ Version: 6 Status: 5 (db_redir_perm) Fetch time: Tue Dec 04 11:02:09 CST 2007 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 51.19438 Signature: b5baaf80e9e10aa6205fc39051c362ff Metadata: _pst_:success(1), lastModified=0 which redirects to http://www.cnn.com/?refresh=1 URL: http://www.cnn.com/?refresh=1 Version: 6 Status: 2 (db_fetched) Fetch time: Tue Dec 04 11:02:11 CST 2007 Modified time: Wed Dec 31 18:00:00 CST 1969 Retries since fetch: 0 Retry interval: 2592000 seconds (30 days) Score: 1.0 Signature: b5baaf80e9e10aa6205fc39051c362ff Metadata: _pst_:success(1), lastModified=0 Now, cnn which should be one of the highest, if not the highest ranking site in the index for keywords such as news in fact doesn't show up in the index and it's redirected end page appears much farther down in search results. My proposal is we somehow make OPIC scores follow redirects. To do this we would most likely need to store a start and end url for redirected urls. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Nutch ScoringFilter plugin problems
Hello, I want to create a new ScoringFilter plugin. In order to evaluate how interesting a web page is, I need information about the link structure in the LinkDB. In the method updateDBScore, I have the following lines (among others): 88linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); ... 99System.out.println(Inlinks to + url); 100Inlinks inlinks = linkdb.getInlinks(url); 101System.out.println(a); 102IteratorInlink iIt = inlinks.iterator(); 103System.out.println(b); a always gets printed, but b rarely gets printed, so this seems that in line 102 an error happens, and an exeception is raised. Do you know why this is happening? What am I doing wrong? Thanks.
[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool
[ https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665482#action_12665482 ] Otis Gospodnetic commented on NUTCH-679: I'm not sure, but committing this may mess up Todd's work on merging Fetcher and Fetcher2. Fetcher2 implementing Tool -- Key: NUTCH-679 URL: https://issues.apache.org/jira/browse/NUTCH-679 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: julien nioche Priority: Minor Attachments: Fetcher2.Tool.patch The patch attached makes Fetcher2 implement Tool. As a result we should be able to override parameters on the command line e.g. bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 segments/20090115072836 instead of having to modify the *-site.xml files in conf/ -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney
[jira] Closed: (NUTCH-661) errors when the uri contains space characters
[ https://issues.apache.org/jira/browse/NUTCH-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney closed NUTCH-661. --- Resolution: Won't Fix Fix Version/s: 1.0.0 Assignee: Doğacan Güney Closing this issue as Won't Fix. This can be fixed with a urlnormalizer plugin as suggested in comments. errors when the uri contains space characters -- Key: NUTCH-661 URL: https://issues.apache.org/jira/browse/NUTCH-661 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Environment: RedHat 5.1 Reporter: Christos LAIOS Assignee: Doğacan Güney Fix For: 1.0.0 While spidering our intranet, i get the following errors when the uri contains space characters fetch of http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - FINAL.doc failed with: java.lang.IllegalArgumentException: Invalid uri 'http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - FINAL.doc': escaped absolute path not valid -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2
[ https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665562#action_12665562 ] Doğacan Güney commented on NUTCH-669: - Hi Todd, Can you upload your work to JIRA now, so that we can review and merge it for 1.0? Consolidate code for Fetcher and Fetcher2 - Key: NUTCH-669 URL: https://issues.apache.org/jira/browse/NUTCH-669 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.9.0 Reporter: Todd Lipcon Fix For: 1.0.0 I'd like to consolidate a lot of the common code between Fetcher and Fetcher2.java. It seems to me like there are the following differences: - Fetcher relies on the Protocol to obey robots.txt and crawl delay settings whereas Fetcher2 implements them itself - Fetcher2 uses a different queueing model (queue per crawl host) to accomplish the per-host limiting without making the Protocol do it. I've begun work on this but want to check with people on the following: - What reason is there for Fetcher existing at all since Fetcher2 seems to be a superset of functionality? - Is it on the road map to remove the robots/delay logic from the Http protocol and make Fetcher2's delegation of duties the standard? - Any other improvements wanted for Fetcher while I am in and around the code? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-676) MapWritable is written inefficiently and confusingly
[ https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-676: Attachment: NUTCH-676_v2.patch Patch for the issue. Bumps CrawlDatum version and starts using o.a.h.io.MapWritable in CrawlDatum. Compatibility is preserved by keeping nutch's MapWritable around and adding extra code for reading from nutch MapWritable if CrawlDatum version is 6. Also changes CrawlDatum#toString as hadoop's MapWritable does not have a good toString method. MapWritable is written inefficiently and confusingly Key: NUTCH-676 URL: https://issues.apache.org/jira/browse/NUTCH-676 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Reporter: Todd Lipcon Priority: Minor Attachments: 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, NUTCH-676_v2.patch The MapWritable implemention in o.a.n.crawl is written confusingly - it maintains its own internal linked list which I think may have a bug somewhere (I'm getting an NPE in certain cases in the code, though it's hard to track down) Can anyone comment as to why MapWritable is written the way it is, rather than just using a HashMap or a LinkedHashMap if consistent ordering is important? I imagine that would improve performance. What about just using the Hadoop MapWritable? Obviously that would break some backwards compatibility but it may be a good idea at some point to reduce confusion (I didn't realize that Nutch had its own impl until a few minutes ago) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
From what I know (the way we use hudson) is that hudson has plugins for presenting tool results only and the tools need to be executed during build - and libraries need to be included so they are available to ant. Piotr On Tue, Jan 20, 2009 at 9:40 PM, Doğacan Güney doga...@gmail.com wrote: On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
They've had pmd integrated with Hudson for many months now, I believe. I've seen patches in JIRA that were the result of fixes for problems reported by pmd. Or maybe they run pmd by hand? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 3:40:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney
Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
I have configured hudson for 10 or more projects and always used pmd plugin to display the pmd results only - the actual pmd task to generate report was run from ant script. Maybe there is such possibility tu run pmd reports directly in hudson (not through project build scripts) but I have never come accross it. Piotr On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: They've had pmd integrated with Hudson for many months now, I believe. I've seen patches in JIRA that were the result of fixes for problems reported by pmd. Or maybe they run pmd by hand? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 3:40:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan Güney -- Doğacan Güney -- Doğacan Güney