[Nutch-dev] [jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-07-14 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512712 ] Dennis Kubes commented on NUTCH-471: Ah, sorry, my configuration was the problem. If you don't upgrade

Re: [Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes
If no one has any objections, I will go ahead and commit this. Dennis Kubes Dennis Kubes (JIRA) wrote: [ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497

[Nutch-dev] [jira] Closed: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-497. -- Issue resolved and committed. Extreme Nested Tags causes StackOverflowException in DomContentUtils

[Nutch-dev] [jira] Resolved: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-497. Resolution: Fixed commited with revision 550669 Extreme Nested Tags causes StackOverflowException

Re: [Nutch-dev] svn commit: r550669 - in /lucene/nutch/trunk/src: java/org/apache/nutch/util/ plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang/ plugin/parse-html/src/java/org/apache/n

2007-06-25 Thread Dennis Kubes
ooopsgotta remember to do that. Done. Dennis Chris Mattmann wrote: On 6/25/07 8:34 PM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Author: kubes Date: Mon Jun 25 20:33:59 2007 New Revision: 550669 URL: http://svn.apache.org/viewvc?view=revrev=550669 Log: NUTCH-497: Fixes problems

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: (was: nested-tags-trap2.patch) Extreme Nested Tags causes StackOverflowException

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: (was: nested-tags-trap3.patch) Extreme Nested Tags causes StackOverflowException

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap2.patch added nested-tags-trap2.patch with apache grant Extreme Nested

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap3.patch added nested-tags-trap3.patch with apache grant Extreme Nested

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506894 ] Dennis Kubes commented on NUTCH-497: I agree, I think it would be better to have something generic if we

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap.patch This patch reworks DomContentUtils.getOutlinks to use a stack

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506596 ] Dennis Kubes commented on NUTCH-497: The newest patch is the nested-tags-trap.patch file. Extreme Nested Tags

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap2.patch Patch with the curNodeDepth removed. The patch file is nested

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506725 ] Dennis Kubes commented on NUTCH-497: Doğacan, that is correct. By using the stack we shouldn't get

Re: [Nutch-dev] Welcome Doğacan as Nutch committer

2007-06-12 Thread Dennis Kubes
Congratulations Doğacan, it is good to have you on board. Dennis Andrzej Bialecki wrote: Hi all, I'm glad to announce that the Lucene PMC has voted to add Doğacan Güney as Nutch committer. Welcome, Doğacan! There are 192 open issues in Nutch JIRA waiting to be solved ... just dive in!

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-06 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: ExtremeNestedTags.patch This is a rudimentary fix for those that want a workaround

Re: [Nutch-dev] How to create patch?

2007-06-01 Thread Dennis Kubes
Take a look at this from the wiki: http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer It shows how to create a patch from SVN. To apply a patch to your source code you would use the patch command (on linux) like this: patch -p0 your_patch_file.patch Dennis Kubes Manoharam Reddy wrote

Re: [Nutch-dev] How is lib-http plugin called? It is not there in plugins.include!

2007-05-31 Thread Dennis Kubes
to explicitly specify all plugins in the plugin.includes configuration variable. Dennis Kubes Manoharam Reddy wrote: I was observing plugins.include property of my nutch-site.xml It has has:- valueprotocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url

Re: [Nutch-dev] OutOfMemoryError - Why should the while(1) loop stop?

2007-05-30 Thread Dennis Kubes
the file.content.limit and ftp.content.limit options in your nutch-site.xml file. Dennis Kubes Manoharam Reddy wrote: Time and again I get this error and as a result the segment remains incomplete. This wastes one iteration of the for() loop in which I am doing generate, fetch and update

Re: [Nutch-dev] SIGSEGV

2007-05-07 Thread Dennis Kubes
believe is should throw a malformed url exception. Dennis Kubes -Brian On May 6, 2007, at 11:00 AM, Andrzej Bialecki wrote: Brian Whitman wrote: Got this segfault + crash when fetching in the middle of a large fetch. Seems to be in looking up a hostname? Is this by any chance

Re: [Nutch-dev] SIGSEGV

2007-05-05 Thread Dennis Kubes
Sigsev usually is the result of Hardware errors. At least that is what I have found in the past. I would run memtest on the machine to check for bad memory. Dennis Kubes Brian Whitman wrote: Got this segfault + crash when fetching in the middle of a large fetch. Seems to be in looking up

Re: [Nutch-dev] Perfomance problems and segmenting

2007-04-23 Thread Dennis Kubes
Without more information this sounds like your tomcat search nutch-site.xml file is setup to use the DFS rather than the local file system. Remember that processing jobs occurs on the DFS but for searching, indexes are best moved to the local file system. Dennis Kubes JoostRuiter wrote: Hi

Re: [Nutch-dev] Have anybody thought of replacing CrawlDb with any kind of Rational DB?

2007-04-12 Thread Dennis Kubes
Andrzej Bialecki wrote: wangxu wrote: Have anybody thought of replacing CrawlDb with any kind of Rational DB,mysql,for example? Crawldb is so difficult to manipulate. I often have the requirements to edit several entries in crawdb; But that would cost too much waiting for the mapReduce.

Re: [Nutch-dev] problem parsing HTML

2007-04-12 Thread Dennis Kubes
It happens in org.apache.nutch.parse.html.DOMContentUtils.getOutlinks() which is called from org.apache.nutch.parse.html.HtmlParser. Running some simple tests on your fragment below I get non outlink for this. What version of Nutch are you running? Dennis Kubes Ian Holsman wrote: Hi. I'm

Re: [Nutch-dev] Runing a nutch crawler on Eclipse

2007-04-12 Thread Dennis Kubes
I run the crawler through Nutch all the time. What are the specific errors that you are getting? Dennis Kubes Tanmoy Kumar Mukherjee wrote: Hi . I am having certain problems in running the nutch crawler on eclipse after having followed the tutorial on Nutch wiki. It says canot build

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Dennis Kubes
[X] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... Andrzej Bialecki wrote: Chris Mattmann wrote: [..] [ ] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... +1.

[Nutch-dev] [jira] Created: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-04-04 Thread Dennis Kubes (JIRA)
Components: indexer Affects Versions: 0.9.0 Environment: all Reporter: Dennis Kubes Fix For: 0.9.0 If any of the segment indexes have 0 documents, then the DDRecordReader in DeleteDuplicates throws an IndexOutOfBoundsException. The record reader needs

[Nutch-dev] [jira] Updated: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-04-04 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-467: --- Attachment: nutch-467.patch Submitted by Andrzej Bialecki. DeleteDuplicate fails if Segment index

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Dennis Kubes
That works. I created the JIRA and attached your patch. It passes all build tests and works on my 150K run across my 5 machine dev cluster. Should we go ahead and commit this? Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: Ok, I ran some bigger test crawls 150K with the 0.9RC

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-04 Thread Dennis Kubes
Yeah, I agree, I just didn't know how to proceed with the new branch structure. I will go ahead and put it into the trunk if there are no objections from anyone. Dennis Andrzej Bialecki wrote: Dennis Kubes wrote: That works. I created the JIRA and attached your patch. It passes all

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-03 Thread Dennis Kubes
guess would be that this is a small bug within the lucene libraries when the directories have 0 results. What is everyone's opinion on this in terms of the release? My vote would be to move forward with the release. Dennis Kubes Task Id : task_0027_m_03_3, Status : FAILED task_0027_m_03_3

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-04-02 Thread Dennis Kubes
? If so, ,could you commit it ASAP to the trunk. Once that's done, I'll remove the tag, and star the release process over again, and get an RC out for a vote. Then, we can move forward from there. I will do this immediately. Dennis Kubes Thanks, guys! Cheers, Chris I still propose

[Nutch-dev] [jira] Updated: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-333: --- Attachment: use-nutch-job_patch.txt updated patch, submitted by Doğacan Güney SegmentMerger

[Nutch-dev] [jira] Resolved: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-333. Resolution: Fixed Issue resolved SegmentMerger and SegmentReader should use NutchJob

[Nutch-dev] [jira] Closed: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-333. -- SegmentMerger and SegmentReader should use NutchJob

Re: [Nutch-dev] svn commit: r524932 - in /lucene/nutch/trunk/src/java/org/apache/nutch/segment: SegmentMerger.java SegmentReader.java

2007-04-02 Thread Dennis Kubes
Chris, I have updated changes and resolved and closed the issue. Sorry about not getting to it sooner. Dennis Kubes Chris Mattmann wrote: Hi Dennis, Thanks for taking care of this. :-) Could you update CHANGES.txt as well? Once you take care of that, in about 2 hrs (when I get home

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Dennis Kubes
[X] +1 Release the packages as Apache Nutch 0.9 [ ] -1 Do not release the packages because... I have been running some bigger crawls with the release this morning. Everything looks good. Dennis Kubes Chris Mattmann wrote: Hi Folks, I have posted a candidate for the Apache Nutch 0.9 release

Re: [Nutch-dev] [VOTE] Release Apache Nutch 0.9

2007-03-27 Thread Dennis Kubes
. Dennis Kubes java.lang.RuntimeException: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.metadata.MetaWrapper at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:344) at org.apache.hadoop.mapred.JobConf.getOutputValueClass

Re: [Nutch-dev] FW: [jira] Created: (HADOOP-1147) remove all @author tags from source

2007-03-23 Thread Dennis Kubes
It shouldn't be too much trouble to attack this with the logging changes. Dennis Kubes Chris Mattmann wrote: Hey Doug, Do you think we should do this in Nutch too? I'm in favor of doing this -- what does everyone else feel? Thanks! Cheers, Chris

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-20 Thread Dennis Kubes
I am good to go as well. Dennis Kubes Andrzej Bialecki wrote: Sami Siren wrote: Andrzej Bialecki wrote: Hi all, I just committed Hadoop 0.12.1. Let's double-check that it works ok. Here's the list of Critical/Blocker issues I mentioned before, and their current status: Any other stuff

[Nutch-dev] [jira] Created: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1

2007-03-15 Thread Dennis Kubes (JIRA)
Reporter: Dennis Kubes Assigned To: Dennis Kubes Fix For: 0.9.0 Attachments: hadoop-0.12.1-dev-core.jar This JIRA contains the new hadoop-0.12.1-dev-core.jar as of revision 518636. I far as I can tell this jar doesn't break any of the current Nutch trunk code

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-15 Thread Dennis Kubes
Andrzej Bialecki wrote: Dennis Kubes wrote: Could you perhaps create a JIRA issue and attach the patches from the current trunk/ to your 0.12.1-based version? As soon as 0.12.1 is out the door we can upgrade, and then finally wrap up our release. Do you want me to create a JIRA issue

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
: 23022 min score: 0.0090 avg score: 0.173 max score: 2119.167 status 1 (db_unfetched):9899275 status 2 (db_fetched): 667354 status 3 (db_gone): 11195 status 4 (db_redir_temp): 219507 status 5 (db_redir_perm): 41839 Dennis Kubes Andrzej Bialecki

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
as of yet but I will keep the list informed of our progress. Dennis Kubes Thanks Marc Boucher, aTerra - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
have the childopts set to 1024M. For standard fetching I don't know how much difference it would make. Dennis Kubes Thanks Marc On 3/14/07, Dennis Kubes [EMAIL PROTECTED] wrote: Marc Boucher wrote: Dennis, I'm curious what kind of hardware your 5 system cluster uses? CPU, RAM

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-14 Thread Dennis Kubes
Andrzej Bialecki wrote: Dennis Kubes wrote: The crawl for 1M pages completed successfully. There was an issue with doing a copyToLocal but that has already been filed as a HADOOP bug and the patch will be included in 0.12.x That's very good news, Dennis - thanks for taking the time

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-12 Thread Dennis Kubes
Andrzej Bialecki wrote: Dennis Kubes wrote: I agree there may be subtle bugs. I can do say a full dmoz crawl (~5M pages) with nutch trunk and hadoop 12.1 on a small cluster of 5 machines if this would help? We have already Certainly, that would be most welcome. I will start

Re: [Nutch-dev] Hadoop 0.11.2 vs. 0.12.1

2007-03-11 Thread Dennis Kubes
with 11.2 without problems. I say let's test it and if there aren't any significant issues then let's go with 12.1 if the hadoop team thinks it will be more stable. One question though, are there any concerns about upgrading clusters as opposed to new fetches? Dennis Kubes -- Best regards

Re: [Nutch-dev] 0.9 release

2007-03-11 Thread Dennis Kubes
to disk (in the Yes. The hadoop team implemented a in memory buffer and spill to disk functionality. I believe the about stored in memory before spills is configurable. Dennis Kubes Hadoop temp directory) at once. During the write operation, which lasted no more then 8 seconds each time

Re: [Nutch-dev] svn commit: r516759 - /lucene/nutch/trunk/CHANGES.txt

2007-03-10 Thread Dennis Kubes
section). Could you please append your changes to the end of the file, and recommit? Thanks a lot! Cheers, Chris Sorry about that. I say the warning message thinking it was a version break. Everything should be fixed now. Dennis Kubes On 3/10/07 10:03 AM, [EMAIL PROTECTED] [EMAIL

Re: [Nutch-dev] How to read data from segments

2007-03-09 Thread Dennis Kubes
. The more documentation we have, especially for new developers, the better. If you need any questions answered in doing this, give me a shout and I will help as much as I can. Dennis Kubes Regards, Steve - Take

Re: [Nutch-dev] 0.9 release

2007-03-09 Thread Dennis Kubes
Dennis Kubes wrote: Dennis Kubes wrote: I was looking through the JIRA to try and help create a list for this release and to say the least it is a little overwhelming. It looks like there are 183 issues total with 152 being unassigned. What has been the current process for testing

Re: [Nutch-dev] How to read data from segments

2007-03-09 Thread Dennis Kubes
Steve Severance wrote: -Original Message- From: Dennis Kubes [mailto:[EMAIL PROTECTED] Sent: Friday, March 09, 2007 9:47 AM To: nutch-dev@lucene.apache.org Subject: Re: How to read data from segments Steve Severance wrote: I am trying to learn the internals of Nutch

[Nutch-dev] [jira] Resolved: (NUTCH-233) wrong regular expression hang reduce process for ever

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-233. Resolution: Fixed The new regex has been added to both the regex-urlfilter.txt and the crawl

[Nutch-dev] [jira] Closed: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-436. -- Issue closed. Incorrect handling of relative paths when the embedded URL path is empty

[Nutch-dev] [jira] Resolved: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-436. Resolution: Fixed Patch tested on 10,000 URL run with no apparent issues. Reviewed and committed

[Nutch-dev] [jira] Closed: (NUTCH-233) wrong regular expression hang reduce process for ever

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-233. -- Issue closed wrong regular expression hang reduce process for ever

Re: [Nutch-dev] FW: Nutch release process help

2007-03-06 Thread Dennis Kubes
, and for that, I apologize. I will do my best to thoroughly vet all such discussions on the nutch list in the future. No issues with me. Dennis Kubes Cheers, Chris -- Forwarded Message From: Chris Mattmann [EMAIL PROTECTED] Date: Mon, 05 Mar 2007 21:25:30 -0800 To: Piotr

Re: [Nutch-dev] java.io.FileNotFoundException: / (Is a directory)

2007-03-05 Thread Dennis Kubes
That is a hadoop.log.dir problem value not being set. It is trying to use the DRFA appender to a file and can't find the log directory. Dennis Gal Nitzan wrote: Just installed latest from trunk. I run mergesegs and I get the following error in all tasks log files (I use default

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-05 Thread Dennis Kubes
Chris Mattmann wrote: Hi Guys, Blocker * NUTCH-400 (Update add missing license headers) - I believe this is fixed and should be closed +1, thanks to Sami for closing it. * NUTCH-353 (pages that serverside forwards will be refetched every time) - this was partially fixed

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-04 Thread Dennis Kubes
NUTCH-436 has a patch now if we want to add that to this release. Dennis Kubes Andrzej Bialecki wrote: Sean Dean wrote: As for which Hadoop version is included in the next Nutch release, I share the same concern as Sami with 0.10.1 as it NPE's on anything above 100-200k URLs. I can

Re: [Nutch-dev] Welcome Dennis Kubes as Nutch committer

2007-03-04 Thread Dennis Kubes
OK. I finally figured out how to republish the site. Only took me 3 days. Feeling hazed now! :) Dennis Kubes Sami Siren wrote: Welcome on board Dennis! -- Sami Siren Dennis Kubes wrote: Hi All, Thank you Andrzej for your kind words. I am looking forward to working together

Re: [Nutch-dev] Issues pending before 0.9 release

2007-03-03 Thread Dennis Kubes
. Dennis Kubes -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

Re: [Nutch-dev] log guards

2007-02-28 Thread Dennis Kubes
I can also work on this, Chris do you want me to do it or do you want to coordinate our efforts? Dennis Kubes Jérôme Charron wrote: Hi Chris, The JIRA issue is the 309 : https://issues.apache.org/jira/browse/NUTCH-309 Thanks for your help. Jérôme On 2/13/07, Chris Mattmann [EMAIL

[Nutch-dev] [jira] Commented: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474713 ] Dennis Kubes commented on NUTCH-447: This tool is for people who need a defined category structure or want

[Nutch-dev] [jira] Created: (NUTCH-448) Allow Plugin Includes and Excludes from File

2007-02-21 Thread Dennis Kubes (JIRA)
Environment: all platforms Reporter: Dennis Kubes Assigned To: Dennis Kubes Priority: Minor Fix For: 0.9.0 This functionality allows the plugin.includes and plugin.excludes values to be moved out of the nutch-default.xml and nutch-site.xml files and loaded

[Nutch-dev] [jira] Updated: (NUTCH-448) Allow Plugin Includes and Excludes from File

2007-02-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-448: --- Attachment: plugin-fromfile.patch The plugin-fromfile.patch file contains the functionality

[Nutch-dev] [jira] Created: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-20 Thread Dennis Kubes (JIRA)
Reporter: Dennis Kubes Assigned To: Dennis Kubes Priority: Minor This is a tool that will take the dmoz structure RDF file and return a listing of the categories. The categories return can be limited by depth or by regular expression pattern. This tool borrows heavily from

[Nutch-dev] [jira] Updated: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-447: --- Attachment: dmoz-structure.patch Patch that contains the DmozStructureParser class. Dmoz Structure

[Nutch-dev] [jira] Updated: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-247: --- Attachment: agent-names3.patch.txt This patch logs and throws an exception if the agent name

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474355 ] Dennis Kubes commented on NUTCH-247: We could move the code to a utility class but if we want it to be called

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-18 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474068 ] Dennis Kubes commented on NUTCH-247: I agree, but then should we approach the check as a configurable option

[Nutch-dev] [jira] Assigned: (NUTCH-247) robot parser to restrict.

2007-02-17 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes reassigned NUTCH-247: -- Assignee: Dennis Kubes robot parser to restrict

[Nutch-dev] [jira] Updated: (NUTCH-247) robot parser to restrict.

2007-02-17 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-247: --- Attachment: agent-names.patch This patch removes the checks and severe logging from

Re: [Nutch-dev] Injector checking for other than STATUS_INJECTED

2007-02-15 Thread Dennis Kubes
in the Reducer. Dennis Kubes Andrzej Bialecki wrote: Gal Nitzan wrote: Hi Andrzej, Does it mean that when you inject an existing (in crawldb) a URL it changes its status to STATUS_DB_UNFETCHED? With the current version of Injector - it won't. With previous versions - it might

Re: [Nutch-dev] Injector checking for other than STATUS_INJECTED

2007-02-15 Thread Dennis Kubes
AhhhNow I get it :) Andrzej Bialecki wrote: Dennis Kubes wrote: Sorry. I am still not getting this. I understand the reason but I am not seeing how it works. Ah, because apparently it doesn't ... :( You were right, the first job consists only of new records. Now that I checked

Re: [Nutch-dev] NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-14 Thread Dennis Kubes
pdfbox software to parse PDF files so you may want to take the specific file and see if it parses correctly outside of nutch using pdfbox. Dennis Kubes Armel T. Nene wrote: Dennis I was wondering if this patch could fix my problem which is, if not the same, very similar to this one. I am

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-14 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473295 ] Dennis Kubes commented on NUTCH-247: I think the idea here is to NOT allow people to run fetchers for which

[Nutch-dev] [jira] Updated: (NUTCH-437) MapFile in Hadoop Trunk has changed, must update references

2007-02-13 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-437: --- Description: The MapFile.Writer signature has changed in hadoop trunk (version 10.x +) to include

Re: [Nutch-dev] NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-13 Thread Dennis Kubes
This has to do with HADOOP-964. Replace the jar files in your Nutch versions with the most recent versions from Hadoop. You will also need to apply NUTCH-437 patch to get Nutch to work with the most recent changes to the Hadoop codebase. Dennis Kubes Gal Nitzan wrote: Hi, Does anybody

Re: [Nutch-dev] NPE in org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue

2007-02-13 Thread Dennis Kubes
Actually I take it back. I don't think it is the same problem but I do think it is the right solution. Dennis Kubes Dennis Kubes wrote: This has to do with HADOOP-964. Replace the jar files in your Nutch versions with the most recent versions from Hadoop. You will also need to apply

Re: [Nutch-dev] JobConf Questions

2007-02-06 Thread Dennis Kubes
If no mapper or reducer class is set in the jobConf then the code defaults to IdentityMapper and IdentityReducer respectively which essentially are pass throughs of key/value pairs. Dennis Kubes Charlie Williams wrote: I am very new to the Nutch source code, and have been reading over

[Nutch-dev] [jira] Created: (NUTCH-437) MapFile in Hadoop 0.10.2 has changed, must update references

2007-02-02 Thread Dennis Kubes (JIRA)
Versions: 0.8.2, 0.9.0 Environment: windows xp and java Reporter: Dennis Kubes Assigned To: Dennis Kubes Fix For: 0.8.2, 0.9.0 The MapFile.Writer signature has changed in hadoop 0.10.2 to include a Configuration object. Object in the Nutch codebase

[Nutch-dev] [jira] Updated: (NUTCH-437) MapFile in Hadoop 0.10.2 has changed, must update references

2007-02-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-437: --- Attachment: nutch-hadoop-0.10.2-mapfile.patch This patch changes the references to MapFile.Writer

[Nutch-dev] Cross Platform Administration and Deployment for Nutch and Hadoop

2007-01-23 Thread Dennis Kubes
and we will see if we can integrate the requests into the development. Dennis Kubes - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions

Re: [Nutch-dev] How to Become a Nutch Developer

2007-01-22 Thread Dennis Kubes
Zaheed Haque wrote: On 1/21/07, Andrzej Bialecki [EMAIL PROTECTED] wrote: Well ... so far this process was very informal, because there were so few key developers that they more or less knew what needs to be done, and who is doing what. Hadoop follows a much stricter and formalized model,

Re: [Nutch-dev] How to Become a Nutch Developer

2007-01-22 Thread Dennis Kubes
night so everything should be done it a couple of days. Dennis Kubes Chris Mattmann wrote: Hi Dennis, On 1/21/07 11:47 AM, Dennis Kubes [EMAIL PROTECTED] wrote: All, I am working on a How to Become a Nutch Developer document for the wiki and I need some input. I need an overview

Re: [Nutch-dev] How to Become a Nutch Developer

2007-01-22 Thread Dennis Kubes
Doug Can you answer the question of how to add developer names to JIRA or if that is only for committers? Dennis Doug Cutting wrote: Andrzej Bialecki wrote: The workflow is different - I'm not sure about the details, perhaps Doug can correct me if I'm wrong ... and yes, it uses JIRA

[Nutch-dev] How to Become a Nutch Developer

2007-01-21 Thread Dennis Kubes
in the JIRA system or with the mailing lists, committers, etc? Getting this information together in one place will go a long way toward helping others to start contributing more and more. Thanks for all your input. Dennis Kubes

Re: [Nutch-dev] Next Nutch release

2007-01-20 Thread Dennis Kubes
Andrzej Bialecki wrote: Dennis Kubes wrote: I completely agree with this. I am interested in devoting as much time as possible to seeing the success of Nutch, Hadoop, and Lucene. As our business grows I would also be willing to devote developers full time to work on Nutch, Hadoop

Re: [Nutch-dev] Next Nutch release

2007-01-19 Thread Dennis Kubes
to know to help. At this time I don't think it is a design problem I think it is a people problem. I will be more than willing to head up training, documenting, and helping developers get up to speed. I just need direction in this area myself. Dennis Kubes

Re: [Nutch-dev] How can I get one plugin's root dir

2007-01-17 Thread Dennis Kubes
. Dennis Kubes Scott Green wrote: Thanks Andrzej and Doug! I will try both in my later work and evaluate them. On 1/17/07, Doug Cutting [EMAIL PROTECTED] wrote: Andrzej Bialecki wrote: The reason is that if you pack this file into your job JAR, the job jar would become very large (presumably

Re: [Nutch-dev] How can I get one plugin's root dir

2007-01-15 Thread Dennis Kubes
(conf); PluginDescriptor desc = rep.getPluginDescriptor(parse-html); String path = desc.getPluginPath(); System.out.println(path); Dennis Kubes Scott Green wrote: Can someone give a answer? I dont think it is good idea we put all configuration/resources under conf dir. On 1/15/07

Re: [Nutch-dev] sort result on different set of terms

2007-01-12 Thread Dennis Kubes
to get corr category score and use that for sorting. Any thoughts? You could populate the sort field dynamically but still only a single field. Are you trying to sort on multiple category fields? Dennis Kubes Thanks, On 1/11/07, Dennis Kubes [EMAIL PROTECTED] wrote: You

Re: [Nutch-dev] sort result on different set of terms

2007-01-11 Thread Dennis Kubes
You can write a scoring filter. That is much easier than changing NutchSimplicity. Take a look at the scoring-opic plugin under src. That will demostrate the default scoring algorithm. Dennis Kubes DS jha wrote: Hello - I would like to score summarize results on a different set

Re: [Nutch-dev] Nutch Indexing

2006-10-27 Thread Dennis Kubes
of the scoring algorithm. Dennis Kubes Otis Gospodnetic wrote: Stephane, Nutch uses Lucene for indexing, and Lucene has a class called IndexWriter that is used for indexing Lucene Documents. Here is a quick grep in Nutch's *java files: $ ffjg -l IndexWriter ./src/test/org/apache/nutch

Re: [Nutch-dev] nutch/lucene question...

2006-08-25 Thread Dennis Kubes
bruce wrote: hi... if it's ok, i've got some basic research questions. can someone tell me if there's a limit to the number of simultaneous websites that nutch/lucence can return...? I assume you are asking its indexing capacity. If that is the case it is billions, it is pretty much

[Nutch-dev] Single Search Server, Multiple Indexes on Separate Disks

2006-08-24 Thread Dennis Kubes
. I am willing to put it into practice and test. Dennis Kubes - Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download

Re: [Nutch-dev] No space left on device

2006-06-14 Thread Dennis Kubes
The tasktracker require intermediate space while performing the map and reduce functions. Many smaller files are produced during the map and reduce processes that are deleted when the processes finish. If you are using the DFS then more disk space is required then is actually used since disk

Re: [Nutch-dev] how to manipulate with MapWritable metaData in CrawlDatum structure

2006-06-11 Thread Dennis Kubes
The MapWritable acts as a shared memory area or Map that you can put other writables into and retrieve them from. To Add metatdata to the CrawlDataum you would use something like this: datum.getMetaData().put(key, value) Where key and value are both Writable implementations such as UTF8 or

  1   2   >