[Nutch-dev] [jira] Commented: (NUTCH-471) Fix synchronization in NutchBean creation

2007-07-14 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12512712 ] Dennis Kubes commented on NUTCH-471: Ah, sorry, my configuration was the problem. If you don't upgrade the

[Nutch-dev] [jira] Closed: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-497. -- Issue resolved and committed. Extreme Nested Tags causes StackOverflowException in

[Nutch-dev] [jira] Resolved: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-25 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-497. Resolution: Fixed commited with revision 550669 Extreme Nested Tags causes StackOverflowException

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: (was: nested-tags-trap2.patch) Extreme Nested Tags causes StackOverflowException in

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: (was: nested-tags-trap3.patch) Extreme Nested Tags causes StackOverflowException in

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap2.patch added nested-tags-trap2.patch with apache grant Extreme Nested

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-24 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap3.patch added nested-tags-trap3.patch with apache grant Extreme Nested

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506894 ] Dennis Kubes commented on NUTCH-497: I agree, I think it would be better to have something generic if we are

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap.patch This patch reworks DomContentUtils.getOutlinks to use a stack

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506596 ] Dennis Kubes commented on NUTCH-497: The newest patch is the nested-tags-trap.patch file. Extreme Nested Tags

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: nested-tags-trap2.patch Patch with the curNodeDepth removed. The patch file is

[Nutch-dev] [jira] Commented: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12506725 ] Dennis Kubes commented on NUTCH-497: Doğacan, that is correct. By using the stack we shouldn't get a

[Nutch-dev] [jira] Updated: (NUTCH-497) Extreme Nested Tags causes StackOverflowException in DomContentUtils...Spider Trap

2007-06-06 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-497: --- Attachment: ExtremeNestedTags.patch This is a rudimentary fix for those that want a workaround for

[Nutch-dev] [jira] Created: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-04-04 Thread Dennis Kubes (JIRA)
DeleteDuplicate fails if Segment index directory has 0 documents Key: NUTCH-467 URL: https://issues.apache.org/jira/browse/NUTCH-467 Project: Nutch Issue Type: Bug

[Nutch-dev] [jira] Updated: (NUTCH-467) DeleteDuplicate fails if Segment index directory has 0 documents

2007-04-04 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-467: --- Attachment: nutch-467.patch Submitted by Andrzej Bialecki. DeleteDuplicate fails if Segment index

[Nutch-dev] [jira] Updated: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-333: --- Attachment: use-nutch-job_patch.txt updated patch, submitted by Doğacan Güney SegmentMerger and

[Nutch-dev] [jira] Resolved: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-333. Resolution: Fixed Issue resolved SegmentMerger and SegmentReader should use NutchJob

[Nutch-dev] [jira] Closed: (NUTCH-333) SegmentMerger and SegmentReader should use NutchJob

2007-04-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-333. -- SegmentMerger and SegmentReader should use NutchJob ---

[Nutch-dev] [jira] Created: (NUTCH-459) Upgrade Nutch to Hadoop 0.12.1

2007-03-15 Thread Dennis Kubes (JIRA)
Upgrade Nutch to Hadoop 0.12.1 -- Key: NUTCH-459 URL: https://issues.apache.org/jira/browse/NUTCH-459 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0 Environment: All platforms

[Nutch-dev] [jira] Resolved: (NUTCH-233) wrong regular expression hang reduce process for ever

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-233. Resolution: Fixed The new regex has been added to both the regex-urlfilter.txt and the

[Nutch-dev] [jira] Closed: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-436. -- Issue closed. Incorrect handling of relative paths when the embedded URL path is empty

[Nutch-dev] [jira] Resolved: (NUTCH-436) Incorrect handling of relative paths when the embedded URL path is empty

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-436?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes resolved NUTCH-436. Resolution: Fixed Patch tested on 10,000 URL run with no apparent issues. Reviewed and committed.

[Nutch-dev] [jira] Closed: (NUTCH-233) wrong regular expression hang reduce process for ever

2007-03-09 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes closed NUTCH-233. -- Issue closed wrong regular expression hang reduce process for ever

[Nutch-dev] [jira] Commented: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474713 ] Dennis Kubes commented on NUTCH-447: This tool is for people who need a defined category structure or want to

[Nutch-dev] [jira] Created: (NUTCH-448) Allow Plugin Includes and Excludes from File

2007-02-21 Thread Dennis Kubes (JIRA)
Allow Plugin Includes and Excludes from File Key: NUTCH-448 URL: https://issues.apache.org/jira/browse/NUTCH-448 Project: Nutch Issue Type: Improvement Affects Versions: 0.9.0

[Nutch-dev] [jira] Updated: (NUTCH-448) Allow Plugin Includes and Excludes from File

2007-02-21 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-448?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-448: --- Attachment: plugin-fromfile.patch The plugin-fromfile.patch file contains the functionality for

[Nutch-dev] [jira] Created: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-20 Thread Dennis Kubes (JIRA)
Dmoz Structure Parser Tool -- Key: NUTCH-447 URL: https://issues.apache.org/jira/browse/NUTCH-447 Project: Nutch Issue Type: New Feature Affects Versions: 0.9.0 Environment: all platforms

[Nutch-dev] [jira] Updated: (NUTCH-447) Dmoz Structure Parser Tool

2007-02-20 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-447: --- Attachment: dmoz-structure.patch Patch that contains the DmozStructureParser class. Dmoz Structure

[Nutch-dev] [jira] Updated: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-247: --- Attachment: agent-names3.patch.txt This patch logs and throws an exception if the agent name is not

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-19 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474355 ] Dennis Kubes commented on NUTCH-247: We could move the code to a utility class but if we want it to be called

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-18 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12474068 ] Dennis Kubes commented on NUTCH-247: I agree, but then should we approach the check as a configurable option.

[Nutch-dev] [jira] Assigned: (NUTCH-247) robot parser to restrict.

2007-02-17 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes reassigned NUTCH-247: -- Assignee: Dennis Kubes robot parser to restrict. -

[Nutch-dev] [jira] Updated: (NUTCH-247) robot parser to restrict.

2007-02-17 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-247: --- Attachment: agent-names.patch This patch removes the checks and severe logging from the

[Nutch-dev] [jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-14 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473295 ] Dennis Kubes commented on NUTCH-247: I think the idea here is to NOT allow people to run fetchers for which they

[Nutch-dev] [jira] Updated: (NUTCH-437) MapFile in Hadoop Trunk has changed, must update references

2007-02-13 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-437: --- Description: The MapFile.Writer signature has changed in hadoop trunk (version 10.x +) to include a

[Nutch-dev] [jira] Created: (NUTCH-437) MapFile in Hadoop 0.10.2 has changed, must update references

2007-02-02 Thread Dennis Kubes (JIRA)
MapFile in Hadoop 0.10.2 has changed, must update references Key: NUTCH-437 URL: https://issues.apache.org/jira/browse/NUTCH-437 Project: Nutch Issue Type: Bug Affects

[Nutch-dev] [jira] Updated: (NUTCH-437) MapFile in Hadoop 0.10.2 has changed, must update references

2007-02-02 Thread Dennis Kubes (JIRA)
[ https://issues.apache.org/jira/browse/NUTCH-437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dennis Kubes updated NUTCH-437: --- Attachment: nutch-hadoop-0.10.2-mapfile.patch This patch changes the references to MapFile.Writer

[Nutch-dev] [jira] Created: (NUTCH-295) More description for fetcher.threads.fetch property

2006-06-02 Thread Dennis Kubes (JIRA)
More description for fetcher.threads.fetch property --- Key: NUTCH-295 URL: http://issues.apache.org/jira/browse/NUTCH-295 Project: Nutch Type: Improvement Components: fetcher Versions: 0.8-dev

[Nutch-dev] [jira] Updated: (NUTCH-295) More description for fetcher.threads.fetch property

2006-06-02 Thread Dennis Kubes (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-295?page=all ] Dennis Kubes updated NUTCH-295: --- Attachment: fetcher_threads_desc.patch More description for fetcher.threads.fetch property as relating to running in distributed mode. More description for

[Nutch-dev] [jira] Created: (NUTCH-255) Regular Expression for RegexUrlNormalizer to remove jsessionid

2006-04-25 Thread Dennis Kubes (JIRA)
Regular Expression for RegexUrlNormalizer to remove jsessionid -- Key: NUTCH-255 URL: http://issues.apache.org/jira/browse/NUTCH-255 Project: Nutch Type: Improvement Components: fetcher Versions:

[Nutch-dev] [jira] Updated: (NUTCH-254) Fetcher throws NullPointer if redirect URL is filtered

2006-04-24 Thread Dennis Kubes (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-254?page=all ] Dennis Kubes updated NUTCH-254: --- Attachment: fetcher_filter_url_patch.txt patch to fix null pointer in fetcher for filtered urls Fetcher throws NullPointer if redirect URL is filtered

[Nutch-dev] [jira] Created: (NUTCH-243) Some meta-refresh urls get ignored due to matching regular expression

2006-04-04 Thread Dennis Kubes (JIRA)
Some meta-refresh urls get ignored due to matching regular expression - Key: NUTCH-243 URL: http://issues.apache.org/jira/browse/NUTCH-243 Project: Nutch Type: Bug Components: fetcher