Re: [jira] Created: (NUTCH-680) Update external jars to latest versions
So, is it OK to remove pmd-ext directory for now? It is not clear if we need it when we have the infrastructure but we don't have the infrastructure now anyway :D. So, I suggest that we remove it for now (and we trim 2.2MB ), and add it back after 1.0 and actually use it. Is everyone OK with this? On Wed, Jan 21, 2009 at 12:01 AM, Piotr Kosiorowski pkosiorow...@gmail.com wrote: I have configured hudson for 10 or more projects and always used pmd plugin to display the pmd results only - the actual pmd task to generate report was run from ant script. Maybe there is such possibility tu run pmd reports directly in hudson (not through project build scripts) but I have never come accross it. Piotr On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic ogjunk-nu...@yahoo.com wrote: They've had pmd integrated with Hudson for many months now, I believe. I've seen patches in JIRA that were the result of fixes for problems reported by pmd. Or maybe they run pmd by hand? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney doga...@gmail.com To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 3:40:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic wrote: That I don't know... I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/ But who knows, maybe maven/ivy fetch them on demand. I don't know. Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)? http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/ Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 1:13:20 PM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic wrote: Lucene doesn't use anything. Hadoop uses pmd integrate in Hudson. Does this mean we do not need pmd jars in nutch ( are they provided by hudson)? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Doğacan Güney To: nutch-dev@lucene.apache.org Sent: Tuesday, January 20, 2009 10:49:44 AM Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest versions 2009/1/20 Piotr Kosiorowski : pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have committed them long time ago in an attempt to bring some static analysis toools to nutch sources. There was a short discussion around it and we all thought t was worth doing but it never gained enough momentum. There is a pmd target in build.xml file that uses it - they are not needed in runtime nor for standard builds. As nutch is built using hudson now I think it would be worth to integrate pmd (and checkstyle/findbugs/cobertura might be also interesting) - hudson has very nice plugins for such tools. I am using it in my daily job and I found it valuable. Thanks for the explanation. I am definitely +1 on having some sort of static analysis tools for nutch. Does anyone know what hadoop/hbase/lucene use for this? or do they use something at all? But as I am not active committer now (I only try to follow mailing lists) I do not think it is my call. But if everyone will be interested I can try to look at integration (but it will move forward slowly - my youngest kid was born just 2 months ago and it takes a lot of attention). Congratulations! Piotr On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote: Update external jars to latest versions --- Key: NUTCH-680 URL: https://issues.apache.org/jira/browse/NUTCH-680 Project: Nutch Issue Type: Improvement Reporter: Doğacan Güney Assignee: Doğacan Güney Priority: Minor Fix For: 1.0.0 This issue will be used to update external libraries nutch uses. These are the libraries that are outdated (upon a quick glance): nekohtml (1.9.9) lucene-highlighter (2.4.0) jdom (1.1) carrot2 - as mentioned in another issue jets3t - above icu4j (4.0.1) jakarta-oro (2.0.8) We should probably update tika to whatever the latest is as well before 1.0. Please add ones I missed in comments. Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen there. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. -- Doğacan
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666475#action_12666475 ] Doğacan Güney commented on NUTCH-666: - Dennis, is it OK to move this issue out of 1.0? Or do you want to commit it before? Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.0.0 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-655) Injecting Crawl metadata
[ https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-655: Fix Version/s: 1.1 Moved to 1.1. Injecting Crawl metadata Key: NUTCH-655 URL: https://issues.apache.org/jira/browse/NUTCH-655 Project: Nutch Issue Type: Improvement Components: injector Reporter: julien nioche Priority: Minor Fix For: 1.1 Attachments: Injector.patch the patch attached allows to inject metadata into the crawlDB. The input file has to contain fields separated by tabs, with the URL being on the first column. The metadata names and values are separated by '='. A input line might look like this: http://www.myurl.com \t categ=value1 \t categ2=value2 This functionality can be useful to store external knowledge and index it with a custom plugin -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666477#action_12666477 ] Doğacan Güney commented on NUTCH-628: - I don't know much about the patch here. Otis, do you have time to update and commit Domain Stats? If not, I will take a look. Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-666: --- Affects Version/s: (was: 1.0.0) 1.1 Fix Version/s: (was: 1.0.0) 1.1 Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666484#action_12666484 ] Dennis Kubes commented on NUTCH-666: It is ok to move to 1.1. Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0
[ https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666489#action_12666489 ] Doğacan Güney commented on NUTCH-673: - It seems that carrot2 API indeed changed. I am getting tons of compile errors. Could you help me figure out the necessary changes? Upgrade the Carrot2 plug-in to release 3.0 -- Key: NUTCH-673 URL: https://issues.apache.org/jira/browse/NUTCH-673 Project: Nutch Issue Type: Improvement Components: web gui Affects Versions: 0.9.0 Environment: All Nutch deployments. Reporter: Sean Dean Priority: Minor Fix For: 1.0.0 Release 3.0 of the Carrot2 plug-in was released recently. We currently have version 2.1 in the source tree and upgrading it to the latest version before 1.0-release might make sence. Details on the release can be found here: http://project.carrot2.org/release-3.0-notes.html One major change in requirements is for JDK 1.5 to be used, but this is also now required for Hadoop 0.19 so this wouldnt be the only reason for the switch. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-386) Plugin to index categories by url rules
[ https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666576#action_12666576 ] Stefano Tauriello commented on NUTCH-386: - Someone can help me? It's very urgent, please. Plugin to index categories by url rules --- Key: NUTCH-386 URL: https://issues.apache.org/jira/browse/NUTCH-386 Project: Nutch Issue Type: New Feature Components: indexer, searcher Reporter: Ernesto De Santis Priority: Minor Attachments: index-url-category-0.1.zip, index-url-category.jar The compressed zip has a install_notes.txt file with instructions. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool
[ https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666763#action_12666763 ] Otis Gospodnetic commented on NUTCH-666: Dennis, could you please describe how this new Lang ID tool is better/different from the previous one? Analysis plugins for multiple language and new Language Identifier Tool --- Key: NUTCH-666 URL: https://issues.apache.org/jira/browse/NUTCH-666 Project: Nutch Issue Type: Improvement Affects Versions: 1.1 Environment: All Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.1 Attachments: NUTCH-666-1-20081126.patch Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, russian, and thai. Also includes a new Language Identifier tool that used the new indexing framework in NUTCH-646. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (NUTCH-628) Host database to keep track of host-level information
[ https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666764#action_12666764 ] Otis Gospodnetic commented on NUTCH-628: Could you take it if you have time, please? Host database to keep track of host-level information - Key: NUTCH-628 URL: https://issues.apache.org/jira/browse/NUTCH-628 Project: Nutch Issue Type: New Feature Components: fetcher, generator Reporter: Otis Gospodnetic Fix For: 1.1 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch Nutch would benefit from having a DB with per-host/domain/TLD information. For instance, Nutch could detect hosts that are timing out, store information about that in this DB. Segment/fetchlist Generator could then skip such hosts, so they don't slow down the fetch job. Another good use for such a DB is keeping track of various host scores, e.g. spam score. From the recent thread on nutch-u...@lucene: Otis asked: While we are at it, how would one go about implementing this DB, as far as its structures go? Andrzej said: The easiest I can imagine is to use something like Text, MapWritable. This way you could store arbitrary information under arbitrary keys. I.e. a single database then could keep track of aggregate statistics at different levels, e.g. TLD, domain, host, ip range, etc. The basic set of statistics could consist of a few predefined gauges, totals and averages. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.