[jira] Commented: (NUTCH-65) index-more plugin can't parse large set of modification-date

2005-07-04 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-65?page=comments#action_12315010 ] Lutischán Ferenc commented on NUTCH-65: --- Dear Developers, I have a finally solution (I have a firewall, I can't make patch with svn), I suggested please commit

[jira] Created: (NUTCH-123) Cache.jsp some times generate NullPointerException

2005-11-04 Thread JIRA
Cache.jsp some times generate NullPointerException -- Key: NUTCH-123 URL: http://issues.apache.org/jira/browse/NUTCH-123 Project: Nutch Type: Bug Components: web gui Environment: All systems Reporter

[jira] Commented: (NUTCH-133) ParserFactory does not work as expected

2005-12-07 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-133?page=comments#action_12359564 ] Lutischán Ferenc commented on NUTCH-133: Dear Stephan, Please see http://issues.apache.org/jira/browse/NUTCH-123. This problem is also problem in cached.jsp. Regards

[jira] Created: (NUTCH-174) Problem encountered with ant during compilation

2006-01-14 Thread JIRA
Problem encountered with ant during compilation --- Key: NUTCH-174 URL: http://issues.apache.org/jira/browse/NUTCH-174 Project: Nutch Type: Bug Versions: 0.7.1 Environment: Suse LInux 9.3 Reporter: Matthias

[jira] Created: (NUTCH-176) Using -dir: creates an error, when the directory already exists

2006-01-15 Thread JIRA
Using -dir: creates an error, when the directory already exists --- Key: NUTCH-176 URL: http://issues.apache.org/jira/browse/NUTCH-176 Project: Nutch Type: Bug Versions: 0.7.1 Environment: SUSE

[jira] Created: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-15 Thread JIRA
Default installation seems to produce working entity of nutch - Key: NUTCH-177 URL: http://issues.apache.org/jira/browse/NUTCH-177 Project: Nutch Type: Bug Versions: 0.7.1 Environment: Linux SUSE

[jira] Updated: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ] Matthias Günter updated NUTCH-177: -- Attachment: crawl-urlfilter.txt The crawl-filter with a change for apache.org Default installation seems to produce working entity of nutch

[jira] Updated: (NUTCH-177) Default installation seems to produce working entity of nutch

2006-01-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-177?page=all ] Matthias Günter updated NUTCH-177: -- Attachment: urllist.txt URL-List used.. Default installation seems to produce working entity of nutch

[jira] Created: (NUTCH-208) http: proxy exception list:

2006-02-08 Thread JIRA
http: proxy exception list: Key: NUTCH-208 URL: http://issues.apache.org/jira/browse/NUTCH-208 Project: Nutch Type: New Feature Components: fetcher Versions: 0.8-dev Reporter: Matthias Günter Priority: Minor I

[jira] Updated: (NUTCH-208) http: proxy exception list:

2006-02-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-208?page=all ] Matthias Günter updated NUTCH-208: -- Attachment: patch.txt A preliminary patch!! http: proxy exception list: --- Key: NUTCH-208 URL: http

[jira] Updated: (NUTCH-208) http: proxy exception list:

2006-02-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-208?page=all ] Matthias Günter updated NUTCH-208: -- Attachment: patch.txt A preliminary patch!! http: proxy exception list: --- Key: NUTCH-208 URL: http

[jira] Commented: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-09-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-339?page=comments#action_12433354 ] Doğacan Güney commented on NUTCH-339: - I have made a few changes to Andrzej's latest patch. The biggest change is that BLOCKED_ADDR_QUEUE is now a priority

[jira] Updated: (NUTCH-339) Refactor nutch to allow fetcher improvements

2006-09-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-339?page=all ] Doğacan Güney updated NUTCH-339: Attachment: patch3.txt Refactor nutch to allow fetcher improvements Key: NUTCH-339

[jira] Created: (NUTCH-397) porting clustering-carrot2 plugin to carrot2 v2.0

2006-11-07 Thread JIRA
porting clustering-carrot2 plugin to carrot2 v2.0 - Key: NUTCH-397 URL: http://issues.apache.org/jira/browse/NUTCH-397 Project: Nutch Issue Type: Improvement Reporter: Do?acan

[jira] Created: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread JIRA
Metadata tries to write null values --- Key: NUTCH-406 URL: http://issues.apache.org/jira/browse/NUTCH-406 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0 Reporter: Doğacan Güney

[jira] Updated: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ] Doğacan Güney updated NUTCH-406: Attachment: NUTCH-406.patch A simple patch that writes nulls as empty strings. Metadata tries to write null values

[jira] Updated: (NUTCH-406) Metadata tries to write null values

2006-11-23 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-406?page=all ] Doğacan Güney updated NUTCH-406: Attachment: NUTCH-406.patch How about something like this then? Metadata tries to write null values --- Key

[jira] Commented: (NUTCH-92) DistributedSearch incorrectly scores results

2006-11-27 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-92?page=comments#action_12453682 ] Dogacan Güney commented on NUTCH-92: Here is my second attempt at this. Now DistributedSearch$Client keeps a mapping from addresses to numDocs, and in search

[jira] Updated: (NUTCH-92) DistributedSearch incorrectly scores results

2006-11-27 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-92?page=all ] Dogacan Güney updated NUTCH-92: --- Attachment: distributed-idf-v2.patch DistributedSearch incorrectly scores results Key: NUTCH-92

[jira] Updated: (NUTCH-411) Parse ignores meta refresh redirection

2006-11-30 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-411?page=all ] Dogacan Güney updated NUTCH-411: Attachment: parse-redirect.patch Parse ignores meta refresh redirection -- Key: NUTCH-411

[jira] Commented: (NUTCH-413) Fetcher ignores -noParsing command line option

2006-12-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-413?page=comments#action_12456832 ] Dogacan Güney commented on NUTCH-413: - Are you sure about this? Running the fetcher (latest trunk) with -noParsing option does not create any parse segments

[jira] Commented: (NUTCH-413) Fetcher ignores -noParsing command line option

2006-12-08 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-413?page=comments#action_12456967 ] Dogacan Güney commented on NUTCH-413: - About command-line options: that is not what I meant(I am not a native speaker). I meant that I also set fetcher.parse

[jira] Commented: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-417?page=comments#action_12458794 ] Dogacan Güney commented on NUTCH-417: - Patch for indexer. Instead of using the FileSystem coming from getRecordWriter, use FileSystem.get(job) to get the file

[jira] Updated: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-417?page=all ] Dogacan Güney updated NUTCH-417: Attachment: index.patch After upgrade to hadoop-0.9.1, parsing and indexing doesn't work

[jira] Commented: (NUTCH-417) After upgrade to hadoop-0.9.1, parsing and indexing doesn't work.

2006-12-15 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-417?page=comments#action_12458811 ] Dogacan Güney commented on NUTCH-417: - Setting speculative execution to false also fixes my problem with parser. Thank you for the quick answer. I guess you

[jira] Created: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2006-12-26 Thread JIRA
DeleteDuplicates.HashPartitioner depends on the order of IndexDocs -- Key: NUTCH-420 URL: http://issues.apache.org/jira/browse/NUTCH-420 Project: Nutch Issue Type: Bug

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2006-12-26 Thread JIRA
[ http://issues.apache.org/jira/browse/NUTCH-420?page=all ] Dogacan Güney updated NUTCH-420: Attachment: dedup.patch Patch for the problem. This patch also slightly refactors the code. DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-04 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12462173 ] Dogacan Güney commented on NUTCH-420: - I realized that my last patch if's some irrevelant LOG.debug code

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-04 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-420: Attachment: dedup-v2.patch DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463056 ] Dogacan Güney commented on NUTCH-420: - I thought I would attach an index which exhibits this bug. If you run

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-420: Attachment: index.tar.gz DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[jira] Commented: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12463214 ] Dogacan Güney commented on NUTCH-420: - Attaching the patch with a testcase (I hope that I got it right, but I am

[jira] Updated: (NUTCH-420) DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

2007-01-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-420?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-420: Attachment: dedup-v3.patch DeleteDuplicates.HashPartitioner depends on the order of IndexDocs

[jira] Created: (NUTCH-438) Add -noAdditions to updatedb

2007-02-02 Thread JIRA
Add -noAdditions to updatedb Key: NUTCH-438 URL: https://issues.apache.org/jira/browse/NUTCH-438 Project: Nutch Issue Type: Improvement Affects Versions: 0.8.1, 0.8 Reporter: Nicolás Lichtmaier

[jira] Updated: (NUTCH-438) Add -noAdditions to updatedb

2007-02-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolás Lichtmaier updated NUTCH-438: - Attachment: noAdditions-backport.diff I've backported revision 450799 to the 0.8.x branch

[jira] Created: (NUTCH-440) Command line utilities should exit with an error message when given wrong arguments

2007-02-06 Thread JIRA
Command line utilities should exit with an error message when given wrong arguments --- Key: NUTCH-440 URL: https://issues.apache.org/jira/browse/NUTCH-440 Project

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: parse-map-core-untested.patch allow parsers to return multiple Parse object

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471260 ] Dogacan Güney commented on NUTCH-443: - Ok, this is the second attempt(sorry that I am sending patches

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: parse-map-core-draft-v1.patch allow parsers to return multiple Parse object

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471620 ] Dogacan Güney commented on NUTCH-443: - This is pretty much the merge of our work(except parse-rss, it kept

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v1.patch allow parsers to return multiple Parse object

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v2.patch Small update to the patch. Now all core junit tests pass. Now

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v3.patch new patch, contains a possible fix for CrawlDbReducer problem

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12471857 ] Dogacan Güney commented on NUTCH-443: - nutch.newbie: I fail to see what the problem is. If feedparser doesn't

[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-10 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-444: Attachment: parse-feed.tar.bz2 OK, here is my feedparsing plugin using rome. Note that this plugin

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v5.patch New version. Now indexing also works but has a catch. Many

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v6.patch Oops... I forgot to merge Renaud Richardet's work

[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dogacan Güney updated NUTCH-444: Attachment: parse-feed-v2.tar.bz2 Updated parse-feed plugin. Still not ready for any serious use

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-02-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12472581 ] Doğacan Güney commented on NUTCH-444: - Hi nutch.newbie, Can you mail me a list of the failing atom urls

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473129 ] Doğacan Güney commented on NUTCH-443: - Andrzej: Thanks for taking the time to review this. The contract

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473184 ] Doğacan Güney commented on NUTCH-443: - Andrzej: Why does fetcher need to synchronize? Why does the order fetcher

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443-draft-v7.patch allow parsers to return multiple Parse object

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-15 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473383 ] Doğacan Güney commented on NUTCH-443: - Regarding the ObjectWritable: since in this case all data is composed

[jira] Created: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-02-15 Thread JIRA
RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt - Key: NUTCH-446 URL: https://issues.apache.org/jira/browse/NUTCH-446 Project: Nutch

[jira] Updated: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-02-15 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-446: Attachment: crawl-delay.patch RobotRulesParser should ignore Crawl-delay values of other bots

[jira] Commented: (NUTCH-247) robot parser to restrict.

2007-02-16 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12473885 ] Doğacan Güney commented on NUTCH-247: - +1 for this approach. Fetcher should check if agent-name is set

[jira] Updated: (NUTCH-434) Replace usage of ObjectWritable with something based on GenericWritable

2007-02-24 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-434: Attachment: NUTCH-434.patch This patch adds two new classes: GenericWritableConfigurable which

[jira] Commented: (NUTCH-445) Domain İndexing / Query Filter

2007-02-27 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476212 ] Doğacan Güney commented on NUTCH-445: - Has anyone looked at this? Google seems to do site: searches like this too

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-28 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12476611 ] Doğacan Güney commented on NUTCH-443: - * you create the fake CrawlDatum-s in ParseOutputFormat, and then set

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-02-28 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443.02282007-v2.patch Yet another patch. ParseResult.filter is out and Nutch

[jira] Updated: (NUTCH-460) RDF parser plugin

2007-03-16 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ricardo J. Méndez updated NUTCH-460: Attachment: rubyspider-rdf.zip Code for the aforementioned plugins, to be included under

[jira] Commented: (NUTCH-460) RDF parser plugin

2007-03-21 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482793 ] Ricardo J. Méndez commented on NUTCH-460: - Two requirements I hadn't added explicitly: Apache Jena: http

[jira] Updated: (NUTCH-438) Add -noAdditions to updatedb

2007-03-27 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Nicolás Lichtmaier updated NUTCH-438: - Description: It would be great for me to have -noAdditions support (which is implemented

[jira] Created: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-09 Thread JIRA
Scoring filter should distribute score to all outlinks at once -- Key: NUTCH-468 URL: https://issues.apache.org/jira/browse/NUTCH-468 Project: Nutch Issue Type: Improvement

[jira] Updated: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-468: Attachment: scoring.patch Patch for the issue. It doesn't change the way scoring-opic works

[jira] Updated: (NUTCH-468) Scoring filter should distribute score to all outlinks at once

2007-04-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-468?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-468: Attachment: scoring-v2.patch That makes sense, patch with the suggested change. Scoring filter

[jira] Created: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread JIRA
Fetcher2 sets server-delay and blocking checks incorrectly -- Key: NUTCH-474 URL: https://issues.apache.org/jira/browse/NUTCH-474 Project: Nutch Issue Type: Bug Components

[jira] Updated: (NUTCH-474) Fetcher2 sets server-delay and blocking checks incorrectly

2007-04-24 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-474: Attachment: fetcher2.patch Fetcher2 sets server-delay and blocking checks incorrectly

[jira] Created: (NUTCH-475) Adaptive crawl delay

2007-04-25 Thread JIRA
Adaptive crawl delay Key: NUTCH-475 URL: https://issues.apache.org/jira/browse/NUTCH-475 Project: Nutch Issue Type: Improvement Components: fetcher Reporter: Doğacan Güney Fix

[jira] Updated: (NUTCH-475) Adaptive crawl delay

2007-04-25 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-475?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-475: Attachment: adaptive-delay_draft.patch Patch with a simple adaptive algorithm. It measures the last

[jira] Updated: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-01 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-446: Attachment: crawl-delay_test.patch Test case for crawl delay rules. Nutch fails the test case

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: NUTCH-443.08052007.patch Patch updated to latest trunk. allow parsers to return

[jira] Commented: (NUTCH-470) Adding optional terms to a query

2007-05-09 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494496 ] Ronny Næss commented on NUTCH-470: -- Hi, Trond. Optional meaning does that mean? I would like more Lucene based

[jira] Commented: (NUTCH-446) RobotRulesParser should ignore Crawl-delay values of other bots in robots.txt

2007-05-10 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494734 ] Doğacan Güney commented on NUTCH-446: - So, does anyone have objections to this? It fixes an annoying (albeit rare

[jira] Commented: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-11 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12494987 ] Doğacan Güney commented on NUTCH-444: - Hi Chris, Well I must say, with all the discussion that's gone on w.r.t

[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495350 ] Doğacan Güney commented on NUTCH-485: - You probably should not add put(String/Text key, Parse parse) methods

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357 ] Doğacan Güney commented on NUTCH-443: - Well... That's embarrassing. It seems I forgot to include the necessary

[jira] Updated: (NUTCH-444) Possibly use a different library to parse RSS feed for improved performance and compatibility

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-444?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-444: Attachment: NUTCH-444.patch feed.tar.bz2 First version of feed plugin featuring

[jira] Commented: (NUTCH-485) Change HtmlParseFilter 's to return ParseResult object instead of Parse object

2007-05-13 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495410 ] Doğacan Güney commented on NUTCH-485: - I have two more minor nits: 1) ParseResult.isSuccess returns true only

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495696 ] Doğacan Güney commented on NUTCH-443: - I am not sure I follow you Andrzej. My patch already does a very similar

[jira] Updated: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-05-14 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-443: Attachment: redirect_and_index_v2.patch New version. Moves parsing code into (content != null

[jira] Updated: (NUTCH-25) needs 'character encoding' detector

2007-05-21 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-25?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-25: --- Attachment: NUTCH-25_draft.patch Well, something like this should work... + Adds a new configurable

[jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-22 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12497770 ] Doğacan Güney commented on NUTCH-489: - This is obviously useful but: * Your patches both in this issue

[jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-23 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498113 ] Doğacan Güney commented on NUTCH-489: - Hmm.. Won't it now cause Nutch to filter on path on a line like

[jira] Created: (NUTCH-491) dedup fails with ArrayIndexOutOfBoundsException

2007-05-23 Thread JIRA
dedup fails with ArrayIndexOutOfBoundsException --- Key: NUTCH-491 URL: https://issues.apache.org/jira/browse/NUTCH-491 Project: Nutch Issue Type: Bug Affects Versions: 0.9.0

[jira] Commented: (NUTCH-491) dedup fails with ArrayIndexOutOfBoundsException

2007-05-24 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12498613 ] Doğacan Güney commented on NUTCH-491: - Can you retry with the latest trunk? I think a fix related to your issue

[jira] Created: (NUTCH-492) java.lang.OutOfMemoryError while indexing.

2007-05-26 Thread JIRA
java.lang.OutOfMemoryError while indexing. -- Key: NUTCH-492 URL: https://issues.apache.org/jira/browse/NUTCH-492 Project: Nutch Issue Type: Bug Components: indexer Affects Versions

[jira] Commented: (NUTCH-489) URLFilter-suffix management of the url path when the url contains some query parameters

2007-05-29 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12499777 ] Doğacan Güney commented on NUTCH-489: - Please ignore my last comment. I don't know what I was on when I wrote

[jira] Created: (NUTCH-494) FindBugs: CrawlDbReader and DeleteDuplicates

2007-05-31 Thread JIRA
FindBugs: CrawlDbReader and DeleteDuplicates Key: NUTCH-494 URL: https://issues.apache.org/jira/browse/NUTCH-494 Project: Nutch Issue Type: Bug Affects Versions: 1.0.0 Reporter

[jira] Updated: (NUTCH-494) FindBugs: CrawlDbReader and DeleteDuplicates

2007-05-31 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-494: Attachment: findbugs_.patch Patch for CrawlDbReader and DeleteDuplicates. FindBugs: CrawlDbReader

[jira] Created: (NUTCH-495) Unnecessary delays in Fetcher2

2007-05-31 Thread JIRA
Unnecessary delays in Fetcher2 -- Key: NUTCH-495 URL: https://issues.apache.org/jira/browse/NUTCH-495 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.0.0 Reporter

[jira] Updated: (NUTCH-495) Unnecessary delays in Fetcher2

2007-05-31 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Doğacan Güney updated NUTCH-495: Attachment: fetcher2_robots.patch Unnecessary delays in Fetcher2

[jira] Commented: (NUTCH-466) Flexible segment format

2007-05-31 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500459 ] Doğacan Güney commented on NUTCH-466: - I skimmed through it and it looks awesome. I will try to test it better

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-01 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500603 ] Doğacan Güney commented on NUTCH-392: - From what I understand of MapFile.Writer code in hadoop, if you give

[jira] Commented: (NUTCH-392) OutputFormat implementations should pass on Progressable

2007-06-02 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12500935 ] Doğacan Güney commented on NUTCH-392: - Perhaps we can allow a user to configure this on a per-structure basis

[jira] Issue Comment Edited: (NUTCH-466) Flexible segment format

2007-06-06 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12501921 ] Doğacan Güney edited comment on NUTCH-466 at 6/6/07 6:08 AM: - I still haven't tested

[jira] Commented: (NUTCH-356) Plugin repository cache can lead to memory leak

2007-06-08 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12502846 ] Doğacan Güney commented on NUTCH-356: - This problem exists with nutch's latest version as evidenced

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505197 ] Doğacan Güney commented on NUTCH-498: - Why can't we just set combiner class as LinkDb? AFAICS, you are not doing

[jira] Commented: (NUTCH-498) Use Combiner in LinkDb to increase speed of linkdb generation

2007-06-15 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-498?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505249 ] Doğacan Güney commented on NUTCH-498: - After examining the code better, I am a bit confused. We have

[jira] Commented: (NUTCH-443) allow parsers to return multiple Parse object, this will speed up the rss parser

2007-06-16 Thread JIRA
[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12505448 ] Doğacan Güney commented on NUTCH-443: - Chris, did you get a chance to look at this? If you are busy, I can assign

  1   2   3   4   5   6   7   8   9   10   >