[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666764#action_12666764
 ] 

Otis Gospodnetic commented on NUTCH-628:


Could you take it if you have time, please?

> Host database to keep track of host-level information
> -
>
> Key: NUTCH-628
> URL: https://issues.apache.org/jira/browse/NUTCH-628
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Otis Gospodnetic
> Fix For: 1.1
>
> Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch
>
>
> Nutch would benefit from having a DB with per-host/domain/TLD information.  
> For instance, Nutch could detect hosts that are timing out, store information 
> about that in this DB.  Segment/fetchlist Generator could then skip such 
> hosts, so they don't slow down the fetch job.  Another good use for such a DB 
> is keeping track of various host scores, e.g. spam score.
> From the recent thread on nutch-u...@lucene:
> Otis asked:
> > While we are at it, how would one go about implementing this DB, as far as 
> > its structures go?
> Andrzej said:
> The easiest I can imagine is to use something like .
> This way you could store arbitrary information under arbitrary keys.
> I.e. a single database then could keep track of aggregate statistics at
> different levels, e.g. TLD, domain, host, ip range, etc. The basic set
> of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666763#action_12666763
 ] 

Otis Gospodnetic commented on NUTCH-666:


Dennis, could you please describe how this new Lang ID tool is better/different 
from the previous one?

> Analysis plugins for multiple language and new Language Identifier Tool
> ---
>
> Key: NUTCH-666
> URL: https://issues.apache.org/jira/browse/NUTCH-666
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.1
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.1
>
> Attachments: NUTCH-666-1-20081126.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
> russian, and thai.  Also includes a new Language Identifier tool that used 
> the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-386) Plugin to index categories by url rules

2009-01-23 Thread Stefano Tauriello (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666576#action_12666576
 ] 

Stefano Tauriello commented on NUTCH-386:
-

Someone can help me?
It's very urgent, please.

> Plugin to index categories by url rules
> ---
>
> Key: NUTCH-386
> URL: https://issues.apache.org/jira/browse/NUTCH-386
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer, searcher
>Reporter: Ernesto De Santis
>Priority: Minor
> Attachments: index-url-category-0.1.zip, index-url-category.jar
>
>
> The compressed zip has a install_notes.txt file with instructions.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-673) Upgrade the Carrot2 plug-in to release 3.0

2009-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666489#action_12666489
 ] 

Doğacan Güney commented on NUTCH-673:
-

It seems that carrot2 API indeed changed. I am getting tons of compile errors. 
Could you help me figure out the necessary changes?

> Upgrade the Carrot2 plug-in to release 3.0
> --
>
> Key: NUTCH-673
> URL: https://issues.apache.org/jira/browse/NUTCH-673
> Project: Nutch
>  Issue Type: Improvement
>  Components: web gui
>Affects Versions: 0.9.0
> Environment: All Nutch deployments.
>Reporter: Sean Dean
>Priority: Minor
> Fix For: 1.0.0
>
>
> Release 3.0 of the Carrot2 plug-in was released recently.
> We currently have version 2.1 in the source tree and upgrading it to the 
> latest version before 1.0-release might make sence.
> Details on the release can be found here: 
> http://project.carrot2.org/release-3.0-notes.html
> One major change in requirements is for JDK 1.5 to be used, but this is also 
> now required for Hadoop 0.19 so this wouldnt be the only reason for the 
> switch.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666484#action_12666484
 ] 

Dennis Kubes commented on NUTCH-666:


It is ok to move to 1.1.  

> Analysis plugins for multiple language and new Language Identifier Tool
> ---
>
> Key: NUTCH-666
> URL: https://issues.apache.org/jira/browse/NUTCH-666
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.1
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.1
>
> Attachments: NUTCH-666-1-20081126.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
> russian, and thai.  Also includes a new Language Identifier tool that used 
> the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread Dennis Kubes (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dennis Kubes updated NUTCH-666:
---

Affects Version/s: (was: 1.0.0)
   1.1
Fix Version/s: (was: 1.0.0)
   1.1

> Analysis plugins for multiple language and new Language Identifier Tool
> ---
>
> Key: NUTCH-666
> URL: https://issues.apache.org/jira/browse/NUTCH-666
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.1
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.1
>
> Attachments: NUTCH-666-1-20081126.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
> russian, and thai.  Also includes a new Language Identifier tool that used 
> the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-628) Host database to keep track of host-level information

2009-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-628?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666477#action_12666477
 ] 

Doğacan Güney commented on NUTCH-628:
-

I don't know much about the patch here. Otis, do you have time to update and 
commit Domain Stats? If not, I will take a look.

> Host database to keep track of host-level information
> -
>
> Key: NUTCH-628
> URL: https://issues.apache.org/jira/browse/NUTCH-628
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, generator
>Reporter: Otis Gospodnetic
> Fix For: 1.1
>
> Attachments: NUTCH-628-DomainStatistics.patch, NUTCH-628-HostDb.patch
>
>
> Nutch would benefit from having a DB with per-host/domain/TLD information.  
> For instance, Nutch could detect hosts that are timing out, store information 
> about that in this DB.  Segment/fetchlist Generator could then skip such 
> hosts, so they don't slow down the fetch job.  Another good use for such a DB 
> is keeping track of various host scores, e.g. spam score.
> From the recent thread on nutch-u...@lucene:
> Otis asked:
> > While we are at it, how would one go about implementing this DB, as far as 
> > its structures go?
> Andrzej said:
> The easiest I can imagine is to use something like .
> This way you could store arbitrary information under arbitrary keys.
> I.e. a single database then could keep track of aggregate statistics at
> different levels, e.g. TLD, domain, host, ip range, etc. The basic set
> of statistics could consist of a few predefined gauges, totals and averages.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-655) Injecting Crawl metadata

2009-01-23 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-655:


Fix Version/s: 1.1

Moved to 1.1.

> Injecting Crawl metadata
> 
>
> Key: NUTCH-655
> URL: https://issues.apache.org/jira/browse/NUTCH-655
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Reporter: julien nioche
>Priority: Minor
> Fix For: 1.1
>
> Attachments: Injector.patch
>
>
> the patch attached allows to inject metadata into the crawlDB. The input file 
> has to contain fields separated by tabs, with the URL being on the first 
> column. The metadata names and values are separated by '='. A input line 
> might look like this:
> http://www.myurl.com  \t  categ=value1 \t categ2=value2
> This functionality can be useful to store external knowledge and index it 
> with a custom plugin

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-666) Analysis plugins for multiple language and new Language Identifier Tool

2009-01-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12666475#action_12666475
 ] 

Doğacan Güney commented on NUTCH-666:
-

Dennis, is it OK to move this issue out of 1.0? Or do you want to commit it 
before?

> Analysis plugins for multiple language and new Language Identifier Tool
> ---
>
> Key: NUTCH-666
> URL: https://issues.apache.org/jira/browse/NUTCH-666
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.0.0
> Environment: All
>Reporter: Dennis Kubes
>Assignee: Dennis Kubes
> Fix For: 1.0.0
>
> Attachments: NUTCH-666-1-20081126.patch
>
>
> Add analysis plugins for czech, greek, japanese, chinese, korean, dutch, 
> russian, and thai.  Also includes a new Language Identifier tool that used 
> the new indexing framework in NUTCH-646.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-23 Thread Doğacan Güney
So, is it OK to remove pmd-ext directory for now? It is not clear if
we need it when
we have the infrastructure but we don't have the infrastructure now
anyway :D. So,
I suggest that we remove it for now (and we trim 2.2MB ), and add it
back after 1.0
and actually use it.

Is everyone OK with this?

On Wed, Jan 21, 2009 at 12:01 AM, Piotr Kosiorowski
 wrote:
> I have configured hudson for 10 or more projects and always used pmd
> plugin to display the pmd results only - the actual pmd task to
> generate report was run from ant script. Maybe there is such
> possibility tu run pmd reports directly in hudson (not through project
> build scripts) but I have never come accross it.
> Piotr
>
> On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic
>  wrote:
>> They've had pmd integrated with Hudson for many months now, I believe.  I've 
>> seen patches in JIRA that were the result of fixes for problems reported by 
>> pmd.  Or maybe they run pmd by hand?
>>
>> Otis
>> --
>> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>
>>
>>
>> - Original Message 
>>> From: Doğacan Güney 
>>> To: nutch-dev@lucene.apache.org
>>> Sent: Tuesday, January 20, 2009 3:40:20 PM
>>> Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
>>> versions
>>>
>>> On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
>>> wrote:
>>> > That I don't know...
>>> >
>>> > I don't see the jars here: 
>>> > http://svn.apache.org/viewvc/hadoop/core/trunk/lib/
>>> >
>>> > But who knows, maybe maven/ivy fetch them on demand.  I don't know.
>>> >
>>>
>>> Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?
>>>
>>> http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/
>>>
>>> > Otis
>>> > --
>>> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >
>>> >
>>> >
>>> > - Original Message 
>>> >> From: Doğacan Güney
>>> >> To: nutch-dev@lucene.apache.org
>>> >> Sent: Tuesday, January 20, 2009 1:13:20 PM
>>> >> Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
>>> versions
>>> >>
>>> >> On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
>>> >> wrote:
>>> >> > Lucene doesn't use anything.
>>> >> > Hadoop uses pmd integrate in Hudson.
>>> >> >
>>> >>
>>> >> Does this mean we do not need pmd jars in nutch ( are they provided by
>>> hudson)?
>>> >>
>>> >> > Otis
>>> >> > --
>>> >> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>>> >> >
>>> >> >
>>> >> >
>>> >> > - Original Message 
>>> >> >> From: Doğacan Güney
>>> >> >> To: nutch-dev@lucene.apache.org
>>> >> >> Sent: Tuesday, January 20, 2009 10:49:44 AM
>>> >> >> Subject: Re: [jira] Created: (NUTCH-680) Update external jars to 
>>> >> >> latest
>>> >> versions
>>> >> >>
>>> >> >> 2009/1/20 Piotr Kosiorowski :
>>> >> >> > pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
>>> >> >> > committed them long time ago in an attempt to bring some static
>>> >> >> > analysis toools to nutch sources. There was a short discussion 
>>> >> >> > around
>>> >> >> > it and we all thought t was worth doing but it never gained enough
>>> >> >> > momentum.   There is a pmd target in build.xml file that uses it -
>>> >> >> > they are not needed in runtime nor for standard builds.
>>> >> >> > As nutch is built using hudson now I think it would be worth to
>>> >> >> > integrate pmd (and checkstyle/findbugs/cobertura might be also
>>> >> >> > interesting) - hudson has very nice plugins for such tools. I am 
>>> >> >> > using
>>> >> >> > it in my daily job and I found it valuable.
>>> >> >>
>>> >> >> Thanks for the explanation. I am definitely +1 on having some sort of
>>> >> >> static analysis tools for nutch.
>>> >> >>
>>> >> >> Does anyone know what hadoop/hbase/lucene use for this? or do
>>> >> >> they use something at all?
>>> >> >>
>>> >> >> > But as I am not active committer now (I only try to follow mailing
>>> >> >> > lists) I do not think it is my call.  But if everyone will be
>>> >> >> > interested I can try to look at integration (but it will move 
>>> >> >> > forward
>>> >> >> > slowly - my youngest kid was born just 2 months ago and it takes a 
>>> >> >> > lot
>>> >> >> > of attention).
>>> >> >>
>>> >> >> Congratulations!
>>> >> >>
>>> >> >> > Piotr
>>> >> >> >
>>> >> >> > On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
>>> >> >> >> Update external jars to latest versions
>>> >> >> >> ---
>>> >> >> >>
>>> >> >> >> Key: NUTCH-680
>>> >> >> >> URL: 
>>> >> >> >> https://issues.apache.org/jira/browse/NUTCH-680
>>> >> >> >> Project: Nutch
>>> >> >> >>  Issue Type: Improvement
>>> >> >> >>Reporter: Doğacan Güney
>>> >> >> >>Assignee: Doğacan Güney
>>> >> >> >>Priority: Minor
>>> >> >> >> Fix For: 1.0.0
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> This issue will be used to update external libraries nutch uses.
>>> >> >> >>
>>> >> >> >> These are the libraries tha