Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-23 Thread Doğacan Güney
So, is it OK to remove pmd-ext directory for now? It is not clear if
we need it when
we have the infrastructure but we don't have the infrastructure now
anyway :D. So,
I suggest that we remove it for now (and we trim 2.2MB ), and add it
back after 1.0
and actually use it.

Is everyone OK with this?

On Wed, Jan 21, 2009 at 12:01 AM, Piotr Kosiorowski
pkosiorow...@gmail.com wrote:
 I have configured hudson for 10 or more projects and always used pmd
 plugin to display the pmd results only - the actual pmd task to
 generate report was run from ant script. Maybe there is such
 possibility tu run pmd reports directly in hudson (not through project
 build scripts) but I have never come accross it.
 Piotr

 On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic
 ogjunk-nu...@yahoo.com wrote:
 They've had pmd integrated with Hudson for many months now, I believe.  I've 
 seen patches in JIRA that were the result of fixes for problems reported by 
 pmd.  Or maybe they run pmd by hand?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 3:40:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
 wrote:
  That I don't know...
 
  I don't see the jars here: 
  http://svn.apache.org/viewvc/hadoop/core/trunk/lib/
 
  But who knows, maybe maven/ivy fetch them on demand.  I don't know.
 

 Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

 http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 1:13:20 PM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
  wrote:
   Lucene doesn't use anything.
   Hadoop uses pmd integrate in Hudson.
  
 
  Does this mean we do not need pmd jars in nutch ( are they provided by
 hudson)?
 
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: Doğacan Güney
   To: nutch-dev@lucene.apache.org
   Sent: Tuesday, January 20, 2009 10:49:44 AM
   Subject: Re: [jira] Created: (NUTCH-680) Update external jars to 
   latest
  versions
  
   2009/1/20 Piotr Kosiorowski :
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
committed them long time ago in an attempt to bring some static
analysis toools to nutch sources. There was a short discussion 
around
it and we all thought t was worth doing but it never gained enough
momentum.   There is a pmd target in build.xml file that uses it -
they are not needed in runtime nor for standard builds.
As nutch is built using hudson now I think it would be worth to
integrate pmd (and checkstyle/findbugs/cobertura might be also
interesting) - hudson has very nice plugins for such tools. I am 
using
it in my daily job and I found it valuable.
  
   Thanks for the explanation. I am definitely +1 on having some sort of
   static analysis tools for nutch.
  
   Does anyone know what hadoop/hbase/lucene use for this? or do
   they use something at all?
  
But as I am not active committer now (I only try to follow mailing
lists) I do not think it is my call.  But if everyone will be
interested I can try to look at integration (but it will move 
forward
slowly - my youngest kid was born just 2 months ago and it takes a 
lot
of attention).
  
   Congratulations!
  
Piotr
   
On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
Update external jars to latest versions
---
   
Key: NUTCH-680
URL: 
https://issues.apache.org/jira/browse/NUTCH-680
Project: Nutch
 Issue Type: Improvement
   Reporter: Doğacan Güney
   Assignee: Doğacan Güney
   Priority: Minor
Fix For: 1.0.0
   
   
This issue will be used to update external libraries nutch uses.
   
These are the libraries that are outdated (upon a quick glance):
   
nekohtml (1.9.9)
lucene-highlighter (2.4.0)
jdom (1.1)
carrot2 - as mentioned in another issue
jets3t - above
icu4j (4.0.1)
jakarta-oro (2.0.8)
   
We should probably update tika to whatever the latest is as well 
before
  1.0.
   
   
Please add ones  I missed in comments.
   
Also what exactly is pmd-ext? There is an extra jakarta-oro and 
jaxen
   there.
   
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
   
   
   
  
  
  
   --
   Doğacan 

[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666475#action_12666475
 ] 

Doğacan Güney commented on NUTCH-666:
-

Dennis, is it OK to move this issue out of 1.0? Or do you want to commit it 
before?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.0.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-655) Injecting Crawl metadata

2009-01-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-655:


Fix Version/s: 1.1

Moved to 1.1.

 Injecting Crawl metadata
 

 Key: NUTCH-655
 URL: https://issues.apache.org/jira/browse/NUTCH-655
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Reporter: julien nioche
Priority: Minor
 Fix For: 1.1

 Attachments: Injector.patch


 the patch attached allows to inject metadata into the crawlDB. The input file 
 has to contain fields separated by tabs, with the URL being on the first 
 column. The metadata names and values are separated by '='. A input line 
 might look like this:
 http://www.myurl.com  \t  categ=value1 \t categ2=value2
 This functionality can be useful to store external knowledge and index it 
 with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666477#action_12666477
 ] 

Doğacan Güney commented on NUTCH-628:
-

I don't know much about the patch here. Otis, do you have time to update and 
commit Domain Stats? If not, I will take a look.

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Affects Version/s: (was: 1.0.0)
   1.1
Fix Version/s: (was: 1.0.0)
   1.1

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666484#action_12666484
 ] 

Dennis Kubes commented on NUTCH-666:


It is ok to move to 1.1.  

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2009-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666489#action_12666489
 ] 

Doğacan Güney commented on NUTCH-673:
-

It seems that carrot2 API indeed changed. I am getting tons of compile errors. 
Could you help me figure out the necessary changes?

 Upgrade the Carrot2 plug-in to release 3.0
 --

 Key: NUTCH-673
 URL: https://issues.apache.org/jira/browse/NUTCH-673
 Project: Nutch
  Issue Type: Improvement
  Components: web gui
Affects Versions: 0.9.0
 Environment: All Nutch deployments.
Reporter: Sean Dean
Priority: Minor
 Fix For: 1.0.0


 Release 3.0 of the Carrot2 plug-in was released recently.
 We currently have version 2.1 in the source tree and upgrading it to the 
 latest version before 1.0-release might make sence.
 Details on the release can be found here: 
 http://project.carrot2.org/release-3.0-notes.html
 One major change in requirements is for JDK 1.5 to be used, but this is also 
 now required for Hadoop 0.19 so this wouldnt be the only reason for the 
 switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

2009-01-23 Thread Stefano Tauriello (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666576#action_12666576
 ] 

Stefano Tauriello commented on NUTCH-386:
-

Someone can help me?
It's very urgent, please.

 Plugin to index categories by url rules
 ---

 Key: NUTCH-386
 URL: https://issues.apache.org/jira/browse/NUTCH-386
 Project: Nutch
  Issue Type: New Feature
  Components: indexer, searcher
Reporter: Ernesto De Santis
Priority: Minor
 Attachments: index-url-category-0.1.zip, index-url-category.jar


 The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666763#action_12666763
 ] 

Otis Gospodnetic commented on NUTCH-666:


Dennis, could you please describe how this new Lang ID tool is better/different 
from the previous one?

 Analysis plugins for multiple language and new Language Identifier Tool
 ---

 Key: NUTCH-666
 URL: https://issues.apache.org/jira/browse/NUTCH-666
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.1
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.1

 Attachments: NUTCH-666-1-20081126.patch


 Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
 russian, and thai.  Also includes a new Language Identifier tool that used 
 the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12666764#action_12666764
 ] 

Otis Gospodnetic commented on NUTCH-628:


Could you take it if you have time, please?

 Host database to keep track of host-level information
 -

 Key: NUTCH-628
 URL: https://issues.apache.org/jira/browse/NUTCH-628
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher, generator
Reporter: Otis Gospodnetic
 Fix For: 1.1

 Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch


 Nutch would benefit from having a DB with per-host/domain/TLD information.  
 For instance, Nutch could detect hosts that are timing out, store information 
 about that in this DB.  Segment/fetchlist Generator could then skip such 
 hosts, so they don't slow down the fetch job.  Another good use for such a DB 
 is keeping track of various host scores, e.g. spam score.
 From the recent thread on nutch-u...@lucene:
 Otis asked:
  While we are at it, how would one go about implementing this DB, as far as 
  its structures go?
 Andrzej said:
 The easiest I can imagine is to use something like Text, MapWritable.
 This way you could store arbitrary information under arbitrary keys.
 I.e. a single database then could keep track of aggregate statistics at
 different levels, e.g. TLD, domain, host, ip range, etc. The basic set
 of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.