[Nutch-dev] [jira] Closed: (NUTCH-471) Fix synchronization in NutchBean creation

2007-07-14 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-471. -- Resolution: Fixed > Fix synchronization in NutchBean creat

[Nutch-dev] [jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-07-14 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512712 ] Dennis Kubes commented on NUTCH-471: Ah, sorry, my configuration was the problem. If you don't upgrad

[Nutch-dev] [jira] Reopened: (NUTCH-471) Fix synchronization in NutchBean creation

2007-07-13 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes reopened NUTCH-471: This patch breaks the search.jsp with a null pointer because the nutch bean is no longer created in

Re: [Nutch-dev] svn commit: r550669 - in /lucene/nutch/trunk/src: java/org/apache/nutch/util/ plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ plugin/parse-html/src/java/org/apache/n

2007-06-25 Thread Dennis Kubes
ooopsgotta remember to do that. Done. Dennis Chris Mattmann wrote: > On 6/25/07 8:34 PM, "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> wrote: > >> Author: kubes >> Date: Mon Jun 25 20:33:59 2007 >> New Revision: 550669 >> >> URL: http://svn.apache.org/viewvc?view=rev&rev=550669 >> Log: >> NUTCH-4

[Nutch-dev] [jira] Closed: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-497. -- Issue resolved and committed. > Extreme Nested Tags causes StackOverflowException in DomContentUt

[Nutch-dev] [jira] Resolved: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-497. Resolution: Fixed commited with revision 550669 > Extreme Nested Tags cau

Re: [Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes
If no one has any objections, I will go ahead and commit this. Dennis Kubes Dennis Kubes (JIRA) wrote: > [ > https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Dennis Kubes up

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap3.patch added nested-tags-trap3.patch with apache grant > Extreme Nes

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap2.patch added nested-tags-trap2.patch with apache grant > Extreme Nes

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: (was: nested-tags-trap3.patch) > Extreme Nested Tags causes StackOverflowException

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: (was: nested-tags-trap2.patch) > Extreme Nested Tags causes StackOverflowException

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap3.patch Adds a utility class called NodeWalker which allows a generic

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506894 ] Dennis Kubes commented on NUTCH-497: I agree, I think it would be better to have something generic if we are

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506725 ] Dennis Kubes commented on NUTCH-497: Doğacan, that is correct. By using the stack we shouldn't

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap2.patch Patch with the curNodeDepth removed. The patch file is nested

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506596 ] Dennis Kubes commented on NUTCH-497: The newest patch is the nested-tags-trap.patch file. > Extreme Nested T

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap.patch This patch reworks DomContentUtils.getOutlinks to use a stack

Re: [Nutch-dev] Build failed in Hudson: Nutch-Nightly #123

2007-06-20 Thread Dennis Kubes
Is this the same java 6 error that was popping up a while back? For some reason with java 6 the XML is being parsed differently in the SWF parser and therefore unit tests looking for exact strings were failing. Could this be happening in the feed parser as well? Dennis Kubes Chris Mattmann

Re: [Nutch-dev] Welcome Doğacan as Nutch committer

2007-06-12 Thread Dennis Kubes
Congratulations Doğacan, it is good to have you on board. Dennis Andrzej Bialecki wrote: > Hi all, > > I'm glad to announce that the Lucene PMC has voted to add Doğacan Güney > as Nutch committer. > > Welcome, Doğacan! There are 192 open issues in Nutch JIRA waiting to be > solved ... just di

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-06 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: ExtremeNestedTags.patch This is a rudimentary fix for those that want a workaround for

[Nutch-dev] [jira] Created: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-06 Thread Dennis Kubes (JIRA)
Issue Type: Bug Components: fetcher Affects Versions: 0.9.0, 0.8.1, 1.0.0 Environment: all Reporter: Dennis Kubes Assignee: Dennis Kubes Fix For: 1.0.0 Some webpages have a form of a spider trap that causes a

Re: [Nutch-dev] How to create patch?

2007-06-01 Thread Dennis Kubes
Take a look at this from the wiki: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer It shows how to create a patch from SVN. To apply a patch to your source code you would use the patch command (on linux) like this: patch -p0 < your_patch_file.patch Dennis Kubes Manoharam Reddy wr

Re: [Nutch-dev] How is lib-http plugin called? It is not there in plugins.include!

2007-05-31 Thread Dennis Kubes
explicitly specify all plugins in the plugin.includes configuration variable. Dennis Kubes Manoharam Reddy wrote: > I was observing plugins.include property of my nutch-site.xml > > It has has:- > > protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-

Re: [Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Dennis Kubes
the file.content.limit and ftp.content.limit options in your nutch-site.xml file. Dennis Kubes Manoharam Reddy wrote: > Time and again I get this error and as a result the segment remains > incomplete. This wastes one iteration of the for() loop in which I am > doing generate, fetch a

Re: [Nutch-dev] SIGSEGV

2007-05-07 Thread Dennis Kubes
what happens when java/nutch gets a hostname > that is obviously malformed? I believe is should throw a malformed url exception. Dennis Kubes > > -Brian > > > > > On May 6, 2007, at 11:00 AM, Andrzej Bialecki wrote: > >> Brian Whitman wrote: >>> Got thi

Re: [Nutch-dev] SIGSEGV

2007-05-05 Thread Dennis Kubes
Sigsev usually is the result of Hardware errors. At least that is what I have found in the past. I would run memtest on the machine to check for bad memory. Dennis Kubes Brian Whitman wrote: > Got this segfault + crash when fetching in the middle of a large fetch. > Seems to be in l

Re: [Nutch-dev] Perfomance problems and segmenting

2007-04-23 Thread Dennis Kubes
Without more information this sounds like your tomcat search nutch-site.xml file is setup to use the DFS rather than the local file system. Remember that processing jobs occurs on the DFS but for searching, indexes are best moved to the local file system. Dennis Kubes JoostRuiter wrote: >

Re: [Nutch-dev] Runing a nutch crawler on Eclipse

2007-04-12 Thread Dennis Kubes
I run the crawler through Nutch all the time. What are the specific errors that you are getting? Dennis Kubes Tanmoy Kumar Mukherjee wrote: > Hi . > I am having certain problems in running the nutch crawler on eclipse > after having followed the tutorial on Nutch wiki. It says ca

Re: [Nutch-dev] problem parsing HTML

2007-04-12 Thread Dennis Kubes
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is called from org.apache.nutch.parse.html.HtmlParser. Running some simple tests on your fragment below I get non outlink for this. What version of Nutch are you running? Dennis Kubes Ian Holsman wrote: >

Re: [Nutch-dev] Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Dennis Kubes
Andrzej Bialecki wrote: > wangxu wrote: >> Have anybody thought of replacing CrawlDb with any kind of Rational >> DB,mysql,for example? >> >> Crawldb is so difficult to manipulate. >> I often have the requirements to edit several entries in crawdb; >> But that would cost too much waiting for the

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Dennis Kubes
Yeah, I agree, I just didn't know how to proceed with the new branch structure. I will go ahead and put it into the trunk if there are no objections from anyone. Dennis Andrzej Bialecki wrote: > Dennis Kubes wrote: >> That works. I created the JIRA and attached your pat

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Dennis Kubes
That works. I created the JIRA and attached your patch. It passes all build tests and works on my 150K run across my 5 machine dev cluster. Should we go ahead and commit this? Dennis Andrzej Bialecki wrote: > Dennis Kubes wrote: >> Ok, I ran some bigger test crawls > 150K wi

[Nutch-dev] [jira] Updated: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-04-04 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-467: --- Attachment: nutch-467.patch Submitted by Andrzej Bialecki. > DeleteDuplicate fails if Segment in

[Nutch-dev] [jira] Created: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-04-04 Thread Dennis Kubes (JIRA)
Components: indexer Affects Versions: 0.9.0 Environment: all Reporter: Dennis Kubes Fix For: 0.9.0 If any of the segment indexes have 0 documents, then the DDRecordReader in DeleteDuplicates throws an IndexOutOfBoundsException. The record reader needs to

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Dennis Kubes
[X] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... Andrzej Bialecki wrote: > Chris Mattmann wrote: > [..] >> [ ] +1 Release the packages as Apache Nutch 0.9 >> [ ] -1 Do not release the packages because... > > +1. > > -

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-03 Thread Dennis Kubes
ng it up. My guess would be that this is a small bug within the lucene libraries when the directories have 0 results. What is everyone's opinion on this in terms of the release? My vote would be to move forward with the release. Dennis Kubes Task Id : task_0027_m_03_

Re: [Nutch-dev] svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java

2007-04-02 Thread Dennis Kubes
Chris, I have updated changes and resolved and closed the issue. Sorry about not getting to it sooner. Dennis Kubes Chris Mattmann wrote: > Hi Dennis, > > Thanks for taking care of this. :-) Could you update CHANGES.txt as well? > Once you take care of that, in about 2 hrs (whe

[Nutch-dev] [jira] Closed: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-333. -- > SegmentMerger and SegmentReader should use Nutch

[Nutch-dev] [jira] Resolved: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-333. Resolution: Fixed Issue resolved > SegmentMerger and SegmentReader should use Nutch

[Nutch-dev] [jira] Updated: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-333: --- Attachment: use-nutch-job_patch.txt updated patch, submitted by Doğacan Güney > SegmentMerger

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Dennis Kubes
ime, Los Angeles, PST) > on removing the tag, and starting the process over again. > > In the meanwhile, Dennis, do you have the patch that fixes the issue with > Hadoop? If so, ,could you commit it ASAP to the trunk. Once that's done, > I'll remove the tag, and star th

Re: [Nutch-dev] Next release - 0.10.0 or 1.0.0 ?

2007-03-28 Thread Dennis Kubes
+1 Andrzej Bialecki wrote: > Hi all, > > I know it's a trivial issue, but still ... When this release is out, I > propose that we should name the next release 1.0.0, and not 0.10.0. The > effect is purely psychological, but it also reflects our confidence in > the platform. I think that a 1.

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Dennis Kubes
again (large scale) > 5. if all goes well, finish release process > 6. tag tags/release-0.9 I agree with this process. > > Thoughts? > > Thanks! > > Cheers, > Chris > > > On 3/28/07 10:35 AM, "Andrzej Bialecki" <[EMAIL PROTECTED]> wrote:

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-28 Thread Dennis Kubes
Yes. This seems to have fixed the problem. All, do we want to create a JIRA and commit this for the 0.9 release? Dennis Andrzej Bialecki wrote: > Doğacan Güney wrote: >> Hi, >> >> On 3/28/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: >>> >>> This is

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Dennis Kubes
class loading. Dennis Kubes Dennis Kubes wrote: > I spoke too soon. Below is the output of errors on mergesegs. This > looks more like a Hadoop issue to me, but I will need to dig into it. It > also may be something that I am doing on my end. This was a merge of > three differe

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Dennis Kubes
ahead. Dennis Kubes java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.metadata.MetaWrapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:344) at org.apache.hadoop.mapred.JobConf.getOutputValue

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Dennis Kubes
[X] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... I have been running some bigger crawls with the release this morning. Everything looks good. Dennis Kubes Chris Mattmann wrote: > Hi Folks, > > I have posted a candidate for the Apache

Re: [Nutch-dev] Initiation of 0.9 release process

2007-03-26 Thread Dennis Kubes
Let me know if I can help in any way? Dennis Kubes Chris Mattmann wrote: > Hi Folks, > > As your friendly neighborhood 0.9 release manager, I just wanted to give > you all a heads up that I'd like to begin the release process today. If I > hear no objections by 00:00:00 UT

Re: [Nutch-dev] Problem with modifying Plugin

2007-03-26 Thread Dennis Kubes
You would need to setup your logging configuration to include INFO in the log4j.properties file in the conf directory. Dennis Kubes z0mbi3 wrote: > Hi, > I m new to nutch. I have been trying to understand the working of the opic > scoring plugin but have certain issues: > > I

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-25 Thread Dennis Kubes
I worked through this swf issue a little more and it seems that java 6 parses out the content differently than java 5. My guess is that it is some type of collection change from 5 to 6 because it looks like only the ordering of the elements is different. Dennis Kubes Sample Help

Re: [Nutch-dev] FW: [jira] Created: (HADOOP-1147) remove all @author tags from source

2007-03-23 Thread Dennis Kubes
It shouldn't be too much trouble to attack this with the logging changes. Dennis Kubes Chris Mattmann wrote: > Hey Doug, > > Do you think we should do this in Nutch too? I'm in favor of doing this -- > what does everyone else feel? > > Th

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-21 Thread Dennis Kubes
I did an update, clean, and test and go no errors. BUILD SUCCESSFUL Total time: 6 minutes Sami Siren wrote: > 2007/3/21, Andrzej Bialecki <[EMAIL PROTECTED]>: >> >> Sami Siren wrote: >> > for me it works: >> > >> > ... >> > BUILD SUCCESSFUL >> > Total time: 4 minutes 3 seconds >> >> I did a fresh

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-20 Thread Dennis Kubes
I am good to go as well. Dennis Kubes Andrzej Bialecki wrote: > Sami Siren wrote: >> Andrzej Bialecki wrote: >>> Hi all, >>> >>> I just committed Hadoop 0.12.1. Let's double-check that it works ok. >>> Here's the list of Critical/Blocker

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-15 Thread Dennis Kubes
Andrzej Bialecki wrote: > Dennis Kubes wrote: >>> >>> Could you perhaps create a JIRA issue and attach the patches from the >>> current trunk/ to your 0.12.1-based version? As soon as 0.12.1 is out >>> the door we can upgrade, and then finally wrap

[Nutch-dev] [jira] Created: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1

2007-03-15 Thread Dennis Kubes (JIRA)
Reporter: Dennis Kubes Assigned To: Dennis Kubes Fix For: 0.9.0 Attachments: hadoop-0.12.1-dev-core.jar This JIRA contains the new hadoop-0.12.1-dev-core.jar as of revision 518636. I far as I can tell this jar doesn't break any of the current Nutch trunk

[Nutch-dev] [jira] Updated: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1

2007-03-15 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-459?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-459: --- Attachment: hadoop-0.12.1-dev-core.jar hadoop-0.12.1-dev-core.jar as of revision 518636 > Upgr

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
Andrzej Bialecki wrote: > Dennis Kubes wrote: >> The crawl for 1M pages completed successfully. There was an issue >> with doing a copyToLocal but that has already been filed as a HADOOP >> bug and the patch will be included in 0.12.x >> > > > That&#

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
s that take alot of RAM so I have the childopts set to 1024M. For standard fetching I don't know how much difference it would make. Dennis Kubes > > Thanks > Marc > > On 3/14/07, Dennis Kubes <[EMAIL PROTECTED]> wrote: >> >> >> >> Marc Bouch

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
ll case. I don't have any benchmarks as of yet but I will keep the list informed of our progress. Dennis Kubes > > Thanks > Marc Boucher, aTerra - Take Surveys. Earn Cash. Influence the Future of IT Join

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
: 23022 min score: 0.0090 avg score: 0.173 max score: 2119.167 status 1 (db_unfetched):9899275 status 2 (db_fetched): 667354 status 3 (db_gone): 11195 status 4 (db_redir_temp): 219507 status 5 (db_redir_perm): 41839 Dennis Kubes Andrzej Bialecki

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-12 Thread Dennis Kubes
Andrzej Bialecki wrote: > Dennis Kubes wrote: >> I agree there may be subtle bugs. >> >> I can do say a full dmoz crawl (~5M pages) with nutch trunk and hadoop >> 12.1 on a small cluster of 5 machines if this would help? We have >> already >> > &

Re: [Nutch-dev] 0.9 release

2007-03-11 Thread Dennis Kubes
so) then wrote it all to disk (in the Yes. The hadoop team implemented a in memory buffer and spill to disk functionality. I believe the about stored in memory before spills is configurable. Dennis Kubes > Hadoop temp directory) at once. During the write operation, which lasted > no

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-11 Thread Dennis Kubes
lp? We have already done some crawls > 100K urls with 11.2 without problems. I say let's test it and if there aren't any significant issues then let's go with 12.1 if the hadoop team thinks it will be more stable. One question though, are there any concerns about upgrading clu

Re: [Nutch-dev] svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Dennis Kubes
the unreleased changes section). Could you please append your changes to the > end of the file, and recommit? > > Thanks a lot! > > Cheers, > Chris Sorry about that. I say the warning message thinking it was a version break. Everything should be fixed now. Dennis Kube

[Nutch-dev] [jira] Closed: (NUTCH-233) wrong regular expression hang reduce process for ever

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-233. -- Issue closed > wrong regular expression hang reduce process for e

[Nutch-dev] [jira] Resolved: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-436. Resolution: Fixed Patch tested on 10,000 URL run with no apparent issues. Reviewed and committed

[Nutch-dev] [jira] Closed: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-436. -- Issue closed. > Incorrect handling of relative paths when the embedded URL path is em

[Nutch-dev] [jira] Resolved: (NUTCH-233) wrong regular expression hang reduce process for ever

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-233. Resolution: Fixed The new regex has been added to both the regex-urlfilter.txt and the crawl

Re: [Nutch-dev] How to read data from segments

2007-03-09 Thread Dennis Kubes
Steve Severance wrote: >> -Original Message- >> From: Dennis Kubes [mailto:[EMAIL PROTECTED] >> Sent: Friday, March 09, 2007 9:47 AM >> To: nutch-dev@lucene.apache.org >> Subject: Re: How to read data from segments >> >> >> >> Steve S

Re: [Nutch-dev] 0.9 release

2007-03-09 Thread Dennis Kubes
Dennis Kubes wrote: >> Dennis Kubes wrote: >>> I was looking through the JIRA to try and help create a list for this >>> release and to say the least it is a little overwhelming. It looks >>> like there are 183 issues total with 152 being unassigned. What has

Re: [Nutch-dev] How to read data from segments

2007-03-09 Thread Dennis Kubes
this year the lack of detailed > information for new developers was cited as a barrier to more involvement. I > would be happy to contribute this back to the wiki if there is interest. Absolutely. The more documentation we have, especially for new develope

[Nutch-dev] [jira] Created: (NUTCH-457) Create top level dist directory and checkin KEYS file to subversion be standard with Lucene Java and Hadoop

2007-03-08 Thread Dennis Kubes (JIRA)
/jira/browse/NUTCH-457 Project: Nutch Issue Type: Task Environment: N/A Reporter: Dennis Kubes Assigned To: Dennis Kubes Priority: Minor The KEYS file contains public keys of committers and is used to sign releases. According to a

Re: [Nutch-dev] 0.9 release

2007-03-07 Thread Dennis Kubes
> Dennis Kubes wrote: >> I was looking through the JIRA to try and help create a list for this >> release and to say the least it is a little overwhelming. It looks >> like there are 183 issues total with 152 being unassigned. What has >> been the current process fo

Re: [Nutch-dev] 0.9 release

2007-03-07 Thread Dennis Kubes
re of, as soon as you all give me > the green light. Good by me. > > So, please, committer-brethren, let me know what you think about 1-3, as it > would help me understand how to move forward. > > Thanks! > > Cheers, > Chris > > Dennis Kubes

Re: [Nutch-dev] FW: Nutch release process help

2007-03-06 Thread Dennis Kubes
g when I said that I would email Piotr. It's too bad that > this has turned out to be an issue that I've handled incorrectly, and for > that, I apologize. I will do my best to thoroughly vet all such discussions > on the nutch list in the future. No issues with me. Dennis Kub

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-05 Thread Dennis Kubes
Chris Mattmann wrote: > Hi Guys, > >> Blocker >> >> * NUTCH-400 (Update & add missing license headers) - I believe this is >> fixed and should be closed > > +1, thanks to Sami for closing it. > >> * NUTCH-353 (pages that serverside forwards will be refetched every >> time) - this was

Re: [Nutch-dev] java.io.FileNotFoundException: / (Is a directory)

2007-03-05 Thread Dennis Kubes
That is a hadoop.log.dir problem value not being set. It is trying to use the DRFA appender to a file and can't find the log directory. Dennis Gal Nitzan wrote: > > Just installed latest from trunk. > > I run mergesegs and I get the following error in all tasks log files (I use > default log4

Re: [Nutch-dev] Welcome Dennis Kubes as Nutch committer

2007-03-04 Thread Dennis Kubes
OK. I finally figured out how to republish the site. Only took me 3 days. Feeling hazed now! :) Dennis Kubes Sami Siren wrote: > Welcome on board Dennis! > > -- > Sami Siren > > Dennis Kubes wrote: >> Hi All, >> >> Thank you Andrzej for your kind wo

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-04 Thread Dennis Kubes
NUTCH-436 has a patch now if we want to add that to this release. Dennis Kubes Andrzej Bialecki wrote: > Sean Dean wrote: >> As for which Hadoop version is included in the next Nutch release, I >> share the same concern as Sami with 0.10.1 as it NPE's on anything >>

[Nutch-dev] [jira] Updated: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-04 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-436: --- Attachment: NUTCH-436-20070304.patch NUTCH-436-20070304.patch handles correct encoding of the params

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-03 Thread Dennis Kubes
ersion of Nutch to be soon followed by a minor stable release ... +1 for using 0.11.2. I looked through the release notes for 0.12 and there were some niceties such as HADOOP-432 for undeletes and alot of bug fixes, but it didn't look like there were any critical issues as far as Nutch is con

[Nutch-dev] [jira] Assigned: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-03 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes reassigned NUTCH-436: -- Assignee: Dennis Kubes > Incorrect handling of relative paths when the embedded URL path

[Nutch-dev] [jira] Created: (NUTCH-454) Review Debug Level Log Guards

2007-03-03 Thread Dennis Kubes (JIRA)
: Dennis Kubes Assigned To: Chris A. Mattmann Fix For: 0.9.0, 0.8.1 There are currently log guards (i.e. is*Enabled type code) in many different places in the code. NUTCH-309 is related to removing those log guards. The caveat is that debug level log guards should be

Re: [Nutch-dev] Welcome Dennis Kubes as Nutch committer

2007-02-28 Thread Dennis Kubes
. I am 28 and have been programming for about 12 years. So as first commit I need to add my name and re-publish the website. Let the hazing begin. Dennis Kubes Andrzej Bialecki wrote: > Hi all, > > Some time ago I proposed to Lucene PMC that Dennis should become a Nutch > committer

Re: [Nutch-dev] log guards

2007-02-28 Thread Dennis Kubes
I can also work on this, Chris do you want me to do it or do you want to coordinate our efforts? Dennis Kubes Jérôme Charron wrote: > Hi Chris, > > The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309 > Thanks for your help. > > Jérôme > > On

[Nutch-dev] [jira] Updated: (NUTCH-448) Allow Plugin Includes and Excludes from File

2007-02-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-448: --- Attachment: plugin-fromfile.patch The plugin-fromfile.patch file contains the functionality for

[Nutch-dev] [jira] Created: (NUTCH-448) Allow Plugin Includes and Excludes from File

2007-02-21 Thread Dennis Kubes (JIRA)
Environment: all platforms Reporter: Dennis Kubes Assigned To: Dennis Kubes Priority: Minor Fix For: 0.9.0 This functionality allows the plugin.includes and plugin.excludes values to be moved out of the nutch-default.xml and nutch-site.xml files and loaded

[Nutch-dev] [jira] Commented: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474713 ] Dennis Kubes commented on NUTCH-447: This tool is for people who need a defined category structure or want to

[Nutch-dev] [jira] Updated: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-447: --- Attachment: dmoz-structure.patch Patch that contains the DmozStructureParser class. > Dmoz Struct

[Nutch-dev] [jira] Created: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-20 Thread Dennis Kubes (JIRA)
Reporter: Dennis Kubes Assigned To: Dennis Kubes Priority: Minor This is a tool that will take the dmoz structure RDF file and return a listing of the categories. The categories return can be limited by depth or by regular expression pattern. This tool borrows heavily from

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474355 ] Dennis Kubes commented on NUTCH-247: We could move the code to a utility class but if we want it to be called

[Nutch-dev] [jira] Updated: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-247: --- Attachment: agent-names3.patch.txt This patch logs and throws an exception if the agent name is not

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-18 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474068 ] Dennis Kubes commented on NUTCH-247: I agree, but then should we approach the check as a configurable option

[Nutch-dev] [jira] Updated: (NUTCH-247) robot parser to restrict.

2007-02-17 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-247: --- Attachment: agent-names.patch This patch removes the checks and severe logging from the

[Nutch-dev] [jira] Assigned: (NUTCH-247) robot parser to restrict.

2007-02-17 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes reassigned NUTCH-247: -- Assignee: Dennis Kubes > robot parser to restr

Re: [Nutch-dev] Injector checking for other than STATUS_INJECTED

2007-02-15 Thread Dennis Kubes
AhhhNow I get it :) Andrzej Bialecki wrote: > Dennis Kubes wrote: >> Sorry. I am still not getting this. I understand the reason but I am >> not seeing how it works. > > Ah, because apparently it doesn't ... :( You were right, the first job > consists on

Re: [Nutch-dev] Injector checking for other than STATUS_INJECTED

2007-02-15 Thread Dennis Kubes
come from in the Reducer. Dennis Kubes Andrzej Bialecki wrote: > Gal Nitzan wrote: >> Hi Andrzej, >> >> Does it mean that when you inject an existing (in crawldb) a URL it >> changes >> its status to STATUS_DB_UNFETCHED? >> >> > > With the c

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-14 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473295 ] Dennis Kubes commented on NUTCH-247: I think the idea here is to NOT allow people to run fetchers for which they

Re: [Nutch-dev] NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-14 Thread Dennis Kubes
pdfbox software to parse PDF files so you may want to take the specific file and see if it parses correctly outside of nutch using pdfbox. Dennis Kubes Armel T. Nene wrote: > Dennis > > I was wondering if this patch could fix my problem which is, if not the > same, very similar to

Re: [Nutch-dev] NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-13 Thread Dennis Kubes
Actually I take it back. I don't think it is the same problem but I do think it is the right solution. Dennis Kubes Dennis Kubes wrote: > This has to do with HADOOP-964. Replace the jar files in your Nutch > versions with the most recent versions from Hadoop. You will also need

  1   2   >