[jira] Created: (NUTCH-681) parse-mp3 compilation problem

2009-01-20 Thread Wildan Maulana (JIRA)
parse-mp3 compilation problem
-

 Key: NUTCH-681
 URL: https://issues.apache.org/jira/browse/NUTCH-681
 Project: Nutch
  Issue Type: Bug
  Components: indexer
 Environment: ubuntu, nutch-1.0-dev (trunk revision : 734360)

Reporter: Wildan Maulana
 Fix For: 1.0.0


Due to API changes, the MP3 parser (which is not compiled by default due to 
licensing problem) doesn't compile anymore.

compile:
 [echo] Compiling plugin: parse-mp3
[javac] Compiling 2 source files to 
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/build/parse-mp3/classes
[javac] 
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:53:
 org.apache.nutch.parse.mp3.MP3Parser is not abstract and does not override 
abstract method getParse(org.apache.nutch.protocol.Content) in 
org.apache.nutch.parse.Parser
[javac] public class MP3Parser implements Parser {
[javac]^
[javac] 
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:58:
 getParse(org.apache.nutch.protocol.Content) in 
org.apache.nutch.parse.mp3.MP3Parser cannot implement 
getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser; 
attempting to use incompatible return type
[javac] found   : org.apache.nutch.parse.Parse
[javac] required: org.apache.nutch.parse.ParseResult
[javac]   public Parse getParse(Content content) {
[javac]^
[javac] 
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java:54:
 cannot find symbol
[javac] symbol  : constructor 
Outlink(java.lang.String,java.lang.String,org.apache.hadoop.conf.Configuration)
[javac] location: class org.apache.nutch.parse.Outlink
[javac]   links.add(new Outlink(value, , this.conf));
[javac] ^
[javac] Note: 
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java
 uses unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 3 errors

BUILD FAILED
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/build.xml:113: The following error 
occurred while executing this line:
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build.xml:55: The following 
error occurred while executing this line:
/home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build-plugin.xml:111: 
Compile failed; see the compiler error output for details.


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-681) parse-mp3 compilation problem

2009-01-20 Thread Wildan Maulana (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wildan Maulana updated NUTCH-681:
-

Attachment: MetadataCollector.java-compilation_issues.diff
MP3Parser.java-compilation_issues.diff

please re-check the patch that i have submitted above

 parse-mp3 compilation problem
 -

 Key: NUTCH-681
 URL: https://issues.apache.org/jira/browse/NUTCH-681
 Project: Nutch
  Issue Type: Bug
  Components: indexer
 Environment: ubuntu, nutch-1.0-dev (trunk revision : 734360)
Reporter: Wildan Maulana
 Fix For: 1.0.0

 Attachments: MetadataCollector.java-compilation_issues.diff, 
 MP3Parser.java-compilation_issues.diff


 Due to API changes, the MP3 parser (which is not compiled by default due to 
 licensing problem) doesn't compile anymore.
 compile:
  [echo] Compiling plugin: parse-mp3
 [javac] Compiling 2 source files to 
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/build/parse-mp3/classes
 [javac] 
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:53:
  org.apache.nutch.parse.mp3.MP3Parser is not abstract and does not override 
 abstract method getParse(org.apache.nutch.protocol.Content) in 
 org.apache.nutch.parse.Parser
 [javac] public class MP3Parser implements Parser {
 [javac]^
 [javac] 
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MP3Parser.java:58:
  getParse(org.apache.nutch.protocol.Content) in 
 org.apache.nutch.parse.mp3.MP3Parser cannot implement 
 getParse(org.apache.nutch.protocol.Content) in org.apache.nutch.parse.Parser; 
 attempting to use incompatible return type
 [javac] found   : org.apache.nutch.parse.Parse
 [javac] required: org.apache.nutch.parse.ParseResult
 [javac]   public Parse getParse(Content content) {
 [javac]^
 [javac] 
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java:54:
  cannot find symbol
 [javac] symbol  : constructor 
 Outlink(java.lang.String,java.lang.String,org.apache.hadoop.conf.Configuration)
 [javac] location: class org.apache.nutch.parse.Outlink
 [javac]   links.add(new Outlink(value, , this.conf));
 [javac] ^
 [javac] Note: 
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/parse-mp3/src/java/org/apache/nutch/parse/mp3/MetadataCollector.java
  uses unchecked or unsafe operations.
 [javac] Note: Recompile with -Xlint:unchecked for details.
 [javac] 3 errors
 BUILD FAILED
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/build.xml:113: The following error 
 occurred while executing this line:
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build.xml:55: The 
 following error occurred while executing this line:
 /home/wildan/jobstuff/LIPI/Ngoprek/nutch/src/plugin/build-plugin.xml:111: 
 Compile failed; see the compiler error output for details.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
committed them long time ago in an attempt to bring some static
analysis toools to nutch sources. There was a short discussion around
it and we all thought t was worth doing but it never gained enough
momentum.   There is a pmd target in build.xml file that uses it -
they are not needed in runtime nor for standard builds.
As nutch is built using hudson now I think it would be worth to
integrate pmd (and checkstyle/findbugs/cobertura might be also
interesting) - hudson has very nice plugins for such tools. I am using
it in my daily job and I found it valuable.
But as I am not active committer now (I only try to follow mailing
lists) I do not think it is my call.  But if everyone will be
interested I can try to look at integration (but it will move forward
slowly - my youngest kid was born just 2 months ago and it takes a lot
of attention).
Piotr

On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote:
 Update external jars to latest versions
 ---

 Key: NUTCH-680
 URL: https://issues.apache.org/jira/browse/NUTCH-680
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0


 This issue will be used to update external libraries nutch uses.

 These are the libraries that are outdated (upon a quick glance):

 nekohtml (1.9.9)
 lucene-highlighter (2.4.0)
 jdom (1.1)
 carrot2 - as mentioned in another issue
 jets3t - above
 icu4j (4.0.1)
 jakarta-oro (2.0.8)

 We should probably update tika to whatever the latest is as well before 1.0.


 Please add ones  I missed in comments.

 Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen 
 there.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.




Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Doğacan Güney
2009/1/20 Piotr Kosiorowski pkosiorow...@gmail.com:
 pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
 committed them long time ago in an attempt to bring some static
 analysis toools to nutch sources. There was a short discussion around
 it and we all thought t was worth doing but it never gained enough
 momentum.   There is a pmd target in build.xml file that uses it -
 they are not needed in runtime nor for standard builds.
 As nutch is built using hudson now I think it would be worth to
 integrate pmd (and checkstyle/findbugs/cobertura might be also
 interesting) - hudson has very nice plugins for such tools. I am using
 it in my daily job and I found it valuable.

Thanks for the explanation. I am definitely +1 on having some sort of
static analysis tools for nutch.

Does anyone know what hadoop/hbase/lucene use for this? or do
they use something at all?

 But as I am not active committer now (I only try to follow mailing
 lists) I do not think it is my call.  But if everyone will be
 interested I can try to look at integration (but it will move forward
 slowly - my youngest kid was born just 2 months ago and it takes a lot
 of attention).

Congratulations!

 Piotr

 On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) j...@apache.org wrote:
 Update external jars to latest versions
 ---

 Key: NUTCH-680
 URL: https://issues.apache.org/jira/browse/NUTCH-680
 Project: Nutch
  Issue Type: Improvement
Reporter: Doğacan Güney
Assignee: Doğacan Güney
Priority: Minor
 Fix For: 1.0.0


 This issue will be used to update external libraries nutch uses.

 These are the libraries that are outdated (upon a quick glance):

 nekohtml (1.9.9)
 lucene-highlighter (2.4.0)
 jdom (1.1)
 carrot2 - as mentioned in another issue
 jets3t - above
 icu4j (4.0.1)
 jakarta-oro (2.0.8)

 We should probably update tika to whatever the latest is as well before 1.0.


 Please add ones  I missed in comments.

 Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen 
 there.

 --
 This message is automatically generated by JIRA.
 -
 You can reply to this email to add a comment to the issue online.






-- 
Doğacan Güney


[jira] Closed: (NUTCH-572) Scoring and redirected Urls

2009-01-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-572.
---

Resolution: Invalid

As Dennis suggested, I am closing this issue as Invalid.

 Scoring and redirected Urls
 ---

 Key: NUTCH-572
 URL: https://issues.apache.org/jira/browse/NUTCH-572
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
 Environment: All
Reporter: Dennis Kubes
Assignee: Dennis Kubes
 Fix For: 1.0.0


 When a redirect is found for a given url, the new or end url is stored as the 
 content page and the old CrawlDatum get one of a few redirect codes.  The 
 page that gets indexed in Nutch is the end page and it gets indexed under the 
 end url.  Many times a site will have a significant number of links pointing 
 to start page and very few pointing to the redirected end page.  This is 
 especially true for external links.  Opic scores do not get transfered to the 
 end page but stay with the start page (the one doing the redirecting).  But 
 the start page doesn't get indexed.  Hence the end page will show up in the 
 index but under a usually much reduced score.  A good example of this is 
 cnn.com:
 URL: http://www.cnn.com/
 Version: 6
 Status: 5 (db_redir_perm)
 Fetch time: Tue Dec 04 11:02:09 CST 2007
 Modified time: Wed Dec 31 18:00:00 CST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 51.19438
 Signature: b5baaf80e9e10aa6205fc39051c362ff
 Metadata: _pst_:success(1), lastModified=0
 which redirects to http://www.cnn.com/?refresh=1
 URL: http://www.cnn.com/?refresh=1
 Version: 6
 Status: 2 (db_fetched)
 Fetch time: Tue Dec 04 11:02:11 CST 2007
 Modified time: Wed Dec 31 18:00:00 CST 1969
 Retries since fetch: 0
 Retry interval: 2592000 seconds (30 days)
 Score: 1.0
 Signature: b5baaf80e9e10aa6205fc39051c362ff
 Metadata: _pst_:success(1), lastModified=0
 Now, cnn which should be one of the highest, if not the highest ranking site 
 in the index for keywords such as news in fact doesn't show up in the index 
 and it's redirected end page appears much farther down in search results.  My 
 proposal is we somehow make OPIC scores follow redirects.  To do this we 
 would most likely need to store a start and end url for redirected urls.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Nutch ScoringFilter plugin problems

2009-01-20 Thread Pau
Hello,
I want to create a new ScoringFilter plugin. In order to evaluate how
interesting a web page is, I need information about the link structure in
the LinkDB.
In the method updateDBScore, I have the following lines (among others):

88linkdb = new LinkDbReader(getConf(), new
Path(crawl/linkdb));
...
99System.out.println(Inlinks to  + url);
   100Inlinks inlinks = linkdb.getInlinks(url);
   101System.out.println(a);
   102IteratorInlink iIt = inlinks.iterator();
   103System.out.println(b);

a always gets printed, but b rarely gets printed, so this seems that in
line 102 an error happens, and an exeception is raised. Do you know why this
is happening? What am I doing wrong? Thanks.


[jira] Commented: (NUTCH-679) Fetcher2 implementing Tool

2009-01-20 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665482#action_12665482
 ] 

Otis Gospodnetic commented on NUTCH-679:


I'm not sure, but committing this may mess up Todd's work on merging Fetcher 
and Fetcher2.


 Fetcher2 implementing Tool
 --

 Key: NUTCH-679
 URL: https://issues.apache.org/jira/browse/NUTCH-679
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Reporter: julien nioche
Priority: Minor
 Attachments: Fetcher2.Tool.patch


 The patch attached makes Fetcher2 implement Tool. As a result we should be 
 able to override parameters on the command line e.g. 
 bin/nutch fetch2 -Dfetcher.server.min.delay=1.0 -Dmapred.reduce.tasks=4 
 segments/20090115072836
 instead of having to modify the *-site.xml files in conf/

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
Lucene doesn't use anything.
Hadoop uses pmd integrate in Hudson.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 10:49:44 AM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions
 
 2009/1/20 Piotr Kosiorowski :
  pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
  committed them long time ago in an attempt to bring some static
  analysis toools to nutch sources. There was a short discussion around
  it and we all thought t was worth doing but it never gained enough
  momentum.   There is a pmd target in build.xml file that uses it -
  they are not needed in runtime nor for standard builds.
  As nutch is built using hudson now I think it would be worth to
  integrate pmd (and checkstyle/findbugs/cobertura might be also
  interesting) - hudson has very nice plugins for such tools. I am using
  it in my daily job and I found it valuable.
 
 Thanks for the explanation. I am definitely +1 on having some sort of
 static analysis tools for nutch.
 
 Does anyone know what hadoop/hbase/lucene use for this? or do
 they use something at all?
 
  But as I am not active committer now (I only try to follow mailing
  lists) I do not think it is my call.  But if everyone will be
  interested I can try to look at integration (but it will move forward
  slowly - my youngest kid was born just 2 months ago and it takes a lot
  of attention).
 
 Congratulations!
 
  Piotr
 
  On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
  Update external jars to latest versions
  ---
 
  Key: NUTCH-680
  URL: https://issues.apache.org/jira/browse/NUTCH-680
  Project: Nutch
   Issue Type: Improvement
 Reporter: Doğacan Güney
 Assignee: Doğacan Güney
 Priority: Minor
  Fix For: 1.0.0
 
 
  This issue will be used to update external libraries nutch uses.
 
  These are the libraries that are outdated (upon a quick glance):
 
  nekohtml (1.9.9)
  lucene-highlighter (2.4.0)
  jdom (1.1)
  carrot2 - as mentioned in another issue
  jets3t - above
  icu4j (4.0.1)
  jakarta-oro (2.0.8)
 
  We should probably update tika to whatever the latest is as well before 
  1.0.
 
 
  Please add ones  I missed in comments.
 
  Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen 
 there.
 
  --
  This message is automatically generated by JIRA.
  -
  You can reply to this email to add a comment to the issue online.
 
 
 
 
 
 
 -- 
 Doğacan Güney



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Doğacan Güney
On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:
 Lucene doesn't use anything.
 Hadoop uses pmd integrate in Hudson.


Does this mean we do not need pmd jars in nutch ( are they provided by hudson)?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 10:49:44 AM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 2009/1/20 Piotr Kosiorowski :
  pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
  committed them long time ago in an attempt to bring some static
  analysis toools to nutch sources. There was a short discussion around
  it and we all thought t was worth doing but it never gained enough
  momentum.   There is a pmd target in build.xml file that uses it -
  they are not needed in runtime nor for standard builds.
  As nutch is built using hudson now I think it would be worth to
  integrate pmd (and checkstyle/findbugs/cobertura might be also
  interesting) - hudson has very nice plugins for such tools. I am using
  it in my daily job and I found it valuable.

 Thanks for the explanation. I am definitely +1 on having some sort of
 static analysis tools for nutch.

 Does anyone know what hadoop/hbase/lucene use for this? or do
 they use something at all?

  But as I am not active committer now (I only try to follow mailing
  lists) I do not think it is my call.  But if everyone will be
  interested I can try to look at integration (but it will move forward
  slowly - my youngest kid was born just 2 months ago and it takes a lot
  of attention).

 Congratulations!

  Piotr
 
  On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
  Update external jars to latest versions
  ---
 
  Key: NUTCH-680
  URL: https://issues.apache.org/jira/browse/NUTCH-680
  Project: Nutch
   Issue Type: Improvement
 Reporter: Doğacan Güney
 Assignee: Doğacan Güney
 Priority: Minor
  Fix For: 1.0.0
 
 
  This issue will be used to update external libraries nutch uses.
 
  These are the libraries that are outdated (upon a quick glance):
 
  nekohtml (1.9.9)
  lucene-highlighter (2.4.0)
  jdom (1.1)
  carrot2 - as mentioned in another issue
  jets3t - above
  icu4j (4.0.1)
  jakarta-oro (2.0.8)
 
  We should probably update tika to whatever the latest is as well before 
  1.0.
 
 
  Please add ones  I missed in comments.
 
  Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
 there.
 
  --
  This message is automatically generated by JIRA.
  -
  You can reply to this email to add a comment to the issue online.
 
 
 



 --
 Doğacan Güney





-- 
Doğacan Güney


Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
That I don't know...

I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/

But who knows, maybe maven/ivy fetch them on demand.  I don't know.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 1:13:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions
 
 On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
 wrote:
  Lucene doesn't use anything.
  Hadoop uses pmd integrate in Hudson.
 
 
 Does this mean we do not need pmd jars in nutch ( are they provided by 
 hudson)?
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney 
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 10:49:44 AM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions
 
  2009/1/20 Piotr Kosiorowski :
   pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
   committed them long time ago in an attempt to bring some static
   analysis toools to nutch sources. There was a short discussion around
   it and we all thought t was worth doing but it never gained enough
   momentum.   There is a pmd target in build.xml file that uses it -
   they are not needed in runtime nor for standard builds.
   As nutch is built using hudson now I think it would be worth to
   integrate pmd (and checkstyle/findbugs/cobertura might be also
   interesting) - hudson has very nice plugins for such tools. I am using
   it in my daily job and I found it valuable.
 
  Thanks for the explanation. I am definitely +1 on having some sort of
  static analysis tools for nutch.
 
  Does anyone know what hadoop/hbase/lucene use for this? or do
  they use something at all?
 
   But as I am not active committer now (I only try to follow mailing
   lists) I do not think it is my call.  But if everyone will be
   interested I can try to look at integration (but it will move forward
   slowly - my youngest kid was born just 2 months ago and it takes a lot
   of attention).
 
  Congratulations!
 
   Piotr
  
   On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
   Update external jars to latest versions
   ---
  
   Key: NUTCH-680
   URL: https://issues.apache.org/jira/browse/NUTCH-680
   Project: Nutch
Issue Type: Improvement
  Reporter: Doğacan Güney
  Assignee: Doğacan Güney
  Priority: Minor
   Fix For: 1.0.0
  
  
   This issue will be used to update external libraries nutch uses.
  
   These are the libraries that are outdated (upon a quick glance):
  
   nekohtml (1.9.9)
   lucene-highlighter (2.4.0)
   jdom (1.1)
   carrot2 - as mentioned in another issue
   jets3t - above
   icu4j (4.0.1)
   jakarta-oro (2.0.8)
  
   We should probably update tika to whatever the latest is as well before 
 1.0.
  
  
   Please add ones  I missed in comments.
  
   Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
  there.
  
   --
   This message is automatically generated by JIRA.
   -
   You can reply to this email to add a comment to the issue online.
  
  
  
 
 
 
  --
  Doğacan Güney
 
 
 
 
 
 -- 
 Doğacan Güney



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Doğacan Güney
On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:
 That I don't know...

 I don't see the jars here: http://svn.apache.org/viewvc/hadoop/core/trunk/lib/

 But who knows, maybe maven/ivy fetch them on demand.  I don't know.


Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 1:13:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
 wrote:
  Lucene doesn't use anything.
  Hadoop uses pmd integrate in Hudson.
 

 Does this mean we do not need pmd jars in nutch ( are they provided by 
 hudson)?

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 10:49:44 AM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  2009/1/20 Piotr Kosiorowski :
   pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
   committed them long time ago in an attempt to bring some static
   analysis toools to nutch sources. There was a short discussion around
   it and we all thought t was worth doing but it never gained enough
   momentum.   There is a pmd target in build.xml file that uses it -
   they are not needed in runtime nor for standard builds.
   As nutch is built using hudson now I think it would be worth to
   integrate pmd (and checkstyle/findbugs/cobertura might be also
   interesting) - hudson has very nice plugins for such tools. I am using
   it in my daily job and I found it valuable.
 
  Thanks for the explanation. I am definitely +1 on having some sort of
  static analysis tools for nutch.
 
  Does anyone know what hadoop/hbase/lucene use for this? or do
  they use something at all?
 
   But as I am not active committer now (I only try to follow mailing
   lists) I do not think it is my call.  But if everyone will be
   interested I can try to look at integration (but it will move forward
   slowly - my youngest kid was born just 2 months ago and it takes a lot
   of attention).
 
  Congratulations!
 
   Piotr
  
   On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
   Update external jars to latest versions
   ---
  
   Key: NUTCH-680
   URL: https://issues.apache.org/jira/browse/NUTCH-680
   Project: Nutch
Issue Type: Improvement
  Reporter: Doğacan Güney
  Assignee: Doğacan Güney
  Priority: Minor
   Fix For: 1.0.0
  
  
   This issue will be used to update external libraries nutch uses.
  
   These are the libraries that are outdated (upon a quick glance):
  
   nekohtml (1.9.9)
   lucene-highlighter (2.4.0)
   jdom (1.1)
   carrot2 - as mentioned in another issue
   jets3t - above
   icu4j (4.0.1)
   jakarta-oro (2.0.8)
  
   We should probably update tika to whatever the latest is as well before
 1.0.
  
  
   Please add ones  I missed in comments.
  
   Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
  there.
  
   --
   This message is automatically generated by JIRA.
   -
   You can reply to this email to add a comment to the issue online.
  
  
  
 
 
 
  --
  Doğacan Güney
 
 



 --
 Doğacan Güney





-- 
Doğacan Güney


[jira] Closed: (NUTCH-661) errors when the uri contains space characters

2009-01-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney closed NUTCH-661.
---

   Resolution: Won't Fix
Fix Version/s: 1.0.0
 Assignee: Doğacan Güney

Closing this issue as Won't Fix.

This can be fixed with a urlnormalizer plugin as suggested in comments.

 errors when the uri contains space characters 
 --

 Key: NUTCH-661
 URL: https://issues.apache.org/jira/browse/NUTCH-661
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
 Environment: RedHat 5.1
Reporter: Christos LAIOS
Assignee: Doğacan Güney
 Fix For: 1.0.0


 While spidering our intranet, i get the following errors when the uri 
 contains space characters
 fetch of http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - 
 FINAL.doc failed with: java.lang.IllegalArgumentException: Invalid uri 
 'http://intranet-rtd.rtd.cec.eu.int/services/docs/AAR_2007 - FINAL.doc': 
 escaped absolute path not valid

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (NUTCH-669) Consolidate code for Fetcher and Fetcher2

2009-01-20 Thread JIRA

[ 
https://issues.apache.org/jira/browse/NUTCH-669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12665562#action_12665562
 ] 

Doğacan Güney commented on NUTCH-669:
-

Hi Todd,

Can you upload your work to JIRA now, so that we can review and merge it for 
1.0?

 Consolidate code for Fetcher and Fetcher2
 -

 Key: NUTCH-669
 URL: https://issues.apache.org/jira/browse/NUTCH-669
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.9.0
Reporter: Todd Lipcon
 Fix For: 1.0.0


 I'd like to consolidate a lot of the common code between Fetcher and 
 Fetcher2.java.
 It seems to me like there are the following differences:
   - Fetcher relies on the Protocol to obey robots.txt and crawl delay 
 settings whereas Fetcher2 implements them itself
   - Fetcher2 uses a different queueing model (queue per crawl host) to 
 accomplish the per-host limiting without making the Protocol do it.
 I've begun work on this but want to check with people on the following:
 - What reason is there for Fetcher existing at all since Fetcher2 seems to be 
 a superset of functionality?
 - Is it on the road map to remove the robots/delay logic from the Http 
 protocol and make Fetcher2's delegation of duties the standard?
 - Any other improvements wanted for Fetcher while I am in and around the code?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (NUTCH-676) MapWritable is written inefficiently and confusingly

2009-01-20 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/NUTCH-676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-676:


Attachment: NUTCH-676_v2.patch

Patch for the issue.

Bumps CrawlDatum version and starts using o.a.h.io.MapWritable in CrawlDatum. 
Compatibility
is preserved by keeping nutch's MapWritable around and adding extra code for 
reading from nutch MapWritable if CrawlDatum version is 6.

Also changes CrawlDatum#toString as hadoop's MapWritable does not have a good 
toString method.

 MapWritable is written inefficiently and confusingly
 

 Key: NUTCH-676
 URL: https://issues.apache.org/jira/browse/NUTCH-676
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 0.9.0
Reporter: Todd Lipcon
Priority: Minor
 Attachments: 
 0001-NUTCH-676-Replace-MapWritable-implementation-with-t.patch, 
 NUTCH-676_v2.patch


 The MapWritable implemention in o.a.n.crawl is written confusingly - it 
 maintains its own internal linked list which I think may have a bug somewhere 
 (I'm getting an NPE in certain cases in the code, though it's hard to track 
 down)
 Can anyone comment as to why MapWritable is written the way it is, rather 
 than just using a HashMap or a LinkedHashMap if consistent ordering is 
 important? I imagine that would improve performance.
 What about just using the Hadoop MapWritable? Obviously that would break some 
 backwards compatibility but it may be a good idea at some point to reduce 
 confusion (I didn't realize that Nutch had its own impl until a few minutes 
 ago)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski
From what I know (the way we use hudson) is that hudson has plugins
for presenting tool results only and the tools need to be executed
during build - and libraries need to be included so they are available
to ant.
Piotr

On Tue, Jan 20, 2009 at 9:40 PM, Doğacan Güney doga...@gmail.com wrote:
 On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
 ogjunk-nu...@yahoo.com wrote:
 That I don't know...

 I don't see the jars here: 
 http://svn.apache.org/viewvc/hadoop/core/trunk/lib/

 But who knows, maybe maven/ivy fetch them on demand.  I don't know.


 Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

 http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 1:13:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
 wrote:
  Lucene doesn't use anything.
  Hadoop uses pmd integrate in Hudson.
 

 Does this mean we do not need pmd jars in nutch ( are they provided by 
 hudson)?

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 10:49:44 AM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  2009/1/20 Piotr Kosiorowski :
   pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
   committed them long time ago in an attempt to bring some static
   analysis toools to nutch sources. There was a short discussion around
   it and we all thought t was worth doing but it never gained enough
   momentum.   There is a pmd target in build.xml file that uses it -
   they are not needed in runtime nor for standard builds.
   As nutch is built using hudson now I think it would be worth to
   integrate pmd (and checkstyle/findbugs/cobertura might be also
   interesting) - hudson has very nice plugins for such tools. I am using
   it in my daily job and I found it valuable.
 
  Thanks for the explanation. I am definitely +1 on having some sort of
  static analysis tools for nutch.
 
  Does anyone know what hadoop/hbase/lucene use for this? or do
  they use something at all?
 
   But as I am not active committer now (I only try to follow mailing
   lists) I do not think it is my call.  But if everyone will be
   interested I can try to look at integration (but it will move forward
   slowly - my youngest kid was born just 2 months ago and it takes a lot
   of attention).
 
  Congratulations!
 
   Piotr
  
   On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
   Update external jars to latest versions
   ---
  
   Key: NUTCH-680
   URL: https://issues.apache.org/jira/browse/NUTCH-680
   Project: Nutch
Issue Type: Improvement
  Reporter: Doğacan Güney
  Assignee: Doğacan Güney
  Priority: Minor
   Fix For: 1.0.0
  
  
   This issue will be used to update external libraries nutch uses.
  
   These are the libraries that are outdated (upon a quick glance):
  
   nekohtml (1.9.9)
   lucene-highlighter (2.4.0)
   jdom (1.1)
   carrot2 - as mentioned in another issue
   jets3t - above
   icu4j (4.0.1)
   jakarta-oro (2.0.8)
  
   We should probably update tika to whatever the latest is as well 
   before
 1.0.
  
  
   Please add ones  I missed in comments.
  
   Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
  there.
  
   --
   This message is automatically generated by JIRA.
   -
   You can reply to this email to add a comment to the issue online.
  
  
  
 
 
 
  --
  Doğacan Güney
 
 



 --
 Doğacan Güney





 --
 Doğacan Güney



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Otis Gospodnetic
They've had pmd integrated with Hudson for many months now, I believe.  I've 
seen patches in JIRA that were the result of fixes for problems reported by 
pmd.  Or maybe they run pmd by hand?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



- Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 3:40:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions
 
 On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
 wrote:
  That I don't know...
 
  I don't see the jars here: 
  http://svn.apache.org/viewvc/hadoop/core/trunk/lib/
 
  But who knows, maybe maven/ivy fetch them on demand.  I don't know.
 
 
 Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?
 
 http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/
 
  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney 
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 1:13:20 PM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions
 
  On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
  wrote:
   Lucene doesn't use anything.
   Hadoop uses pmd integrate in Hudson.
  
 
  Does this mean we do not need pmd jars in nutch ( are they provided by 
 hudson)?
 
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: Doğacan Güney
   To: nutch-dev@lucene.apache.org
   Sent: Tuesday, January 20, 2009 10:49:44 AM
   Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
  versions
  
   2009/1/20 Piotr Kosiorowski :
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
committed them long time ago in an attempt to bring some static
analysis toools to nutch sources. There was a short discussion around
it and we all thought t was worth doing but it never gained enough
momentum.   There is a pmd target in build.xml file that uses it -
they are not needed in runtime nor for standard builds.
As nutch is built using hudson now I think it would be worth to
integrate pmd (and checkstyle/findbugs/cobertura might be also
interesting) - hudson has very nice plugins for such tools. I am using
it in my daily job and I found it valuable.
  
   Thanks for the explanation. I am definitely +1 on having some sort of
   static analysis tools for nutch.
  
   Does anyone know what hadoop/hbase/lucene use for this? or do
   they use something at all?
  
But as I am not active committer now (I only try to follow mailing
lists) I do not think it is my call.  But if everyone will be
interested I can try to look at integration (but it will move forward
slowly - my youngest kid was born just 2 months ago and it takes a lot
of attention).
  
   Congratulations!
  
Piotr
   
On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
Update external jars to latest versions
---
   
Key: NUTCH-680
URL: https://issues.apache.org/jira/browse/NUTCH-680
Project: Nutch
 Issue Type: Improvement
   Reporter: Doğacan Güney
   Assignee: Doğacan Güney
   Priority: Minor
Fix For: 1.0.0
   
   
This issue will be used to update external libraries nutch uses.
   
These are the libraries that are outdated (upon a quick glance):
   
nekohtml (1.9.9)
lucene-highlighter (2.4.0)
jdom (1.1)
carrot2 - as mentioned in another issue
jets3t - above
icu4j (4.0.1)
jakarta-oro (2.0.8)
   
We should probably update tika to whatever the latest is as well 
before
  1.0.
   
   
Please add ones  I missed in comments.
   
Also what exactly is pmd-ext? There is an extra jakarta-oro and jaxen
   there.
   
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
   
   
   
  
  
  
   --
   Doğacan Güney
  
  
 
 
 
  --
  Doğacan Güney
 
 
 
 
 
 -- 
 Doğacan Güney



Re: [jira] Created: (NUTCH-680) Update external jars to latest versions

2009-01-20 Thread Piotr Kosiorowski
I have configured hudson for 10 or more projects and always used pmd
plugin to display the pmd results only - the actual pmd task to
generate report was run from ant script. Maybe there is such
possibility tu run pmd reports directly in hudson (not through project
build scripts) but I have never come accross it.
Piotr

On Tue, Jan 20, 2009 at 10:39 PM, Otis Gospodnetic
ogjunk-nu...@yahoo.com wrote:
 They've had pmd integrated with Hudson for many months now, I believe.  I've 
 seen patches in JIRA that were the result of fixes for problems reported by 
 pmd.  Or maybe they run pmd by hand?

 Otis
 --
 Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



 - Original Message 
 From: Doğacan Güney doga...@gmail.com
 To: nutch-dev@lucene.apache.org
 Sent: Tuesday, January 20, 2009 3:40:20 PM
 Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest 
 versions

 On Tue, Jan 20, 2009 at 10:35 PM, Otis Gospodnetic
 wrote:
  That I don't know...
 
  I don't see the jars here: 
  http://svn.apache.org/viewvc/hadoop/core/trunk/lib/
 
  But who knows, maybe maven/ivy fetch them on demand.  I don't know.
 

 Hmm, does 0.19 use ivy(0.19 also doesn't have pmd)?

 http://svn.apache.org/viewvc/hadoop/core/branches/branch-0.19/lib/

  Otis
  --
  Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
 
 
 
  - Original Message 
  From: Doğacan Güney
  To: nutch-dev@lucene.apache.org
  Sent: Tuesday, January 20, 2009 1:13:20 PM
  Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
 versions
 
  On Tue, Jan 20, 2009 at 7:48 PM, Otis Gospodnetic
  wrote:
   Lucene doesn't use anything.
   Hadoop uses pmd integrate in Hudson.
  
 
  Does this mean we do not need pmd jars in nutch ( are they provided by
 hudson)?
 
   Otis
   --
   Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
  
  
  
   - Original Message 
   From: Doğacan Güney
   To: nutch-dev@lucene.apache.org
   Sent: Tuesday, January 20, 2009 10:49:44 AM
   Subject: Re: [jira] Created: (NUTCH-680) Update external jars to latest
  versions
  
   2009/1/20 Piotr Kosiorowski :
pmd-ext contains PMD (http://pmd.sourceforge.net/) libraries. I have
committed them long time ago in an attempt to bring some static
analysis toools to nutch sources. There was a short discussion around
it and we all thought t was worth doing but it never gained enough
momentum.   There is a pmd target in build.xml file that uses it -
they are not needed in runtime nor for standard builds.
As nutch is built using hudson now I think it would be worth to
integrate pmd (and checkstyle/findbugs/cobertura might be also
interesting) - hudson has very nice plugins for such tools. I am 
using
it in my daily job and I found it valuable.
  
   Thanks for the explanation. I am definitely +1 on having some sort of
   static analysis tools for nutch.
  
   Does anyone know what hadoop/hbase/lucene use for this? or do
   they use something at all?
  
But as I am not active committer now (I only try to follow mailing
lists) I do not think it is my call.  But if everyone will be
interested I can try to look at integration (but it will move forward
slowly - my youngest kid was born just 2 months ago and it takes a 
lot
of attention).
  
   Congratulations!
  
Piotr
   
On Mon, Jan 19, 2009 at 3:02 PM, Doğacan Güney (JIRA) wrote:
Update external jars to latest versions
---
   
Key: NUTCH-680
URL: https://issues.apache.org/jira/browse/NUTCH-680
Project: Nutch
 Issue Type: Improvement
   Reporter: Doğacan Güney
   Assignee: Doğacan Güney
   Priority: Minor
Fix For: 1.0.0
   
   
This issue will be used to update external libraries nutch uses.
   
These are the libraries that are outdated (upon a quick glance):
   
nekohtml (1.9.9)
lucene-highlighter (2.4.0)
jdom (1.1)
carrot2 - as mentioned in another issue
jets3t - above
icu4j (4.0.1)
jakarta-oro (2.0.8)
   
We should probably update tika to whatever the latest is as well 
before
  1.0.
   
   
Please add ones  I missed in comments.
   
Also what exactly is pmd-ext? There is an extra jakarta-oro and 
jaxen
   there.
   
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
   
   
   
  
  
  
   --
   Doğacan Güney
  
  
 
 
 
  --
  Doğacan Güney
 
 



 --
 Doğacan Güney