[Nutch-dev] fetcher failling on urlnormalizer

2005-04-13 Thread Byron Miller
i created 100 fetchlists from a 50million url db and when i try an run fetch i'm getting a few fetches done and then tons of errors on url normalizer - anyone else seeing this? 050414 014446 fetching http://www.theastonline.com/ 050414 014446 fetching http://authoryellowpages.com/featureslist.asp

[Nutch-dev] Crawl-urlfilter cann't deals with relative urls appropriately ??

2005-04-13 Thread cao yuzhong
I just want to fetch all the pages in http://news.buaa.edu.cn So I modified my crawl-urlfilter.txt like this: # # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto|https): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xl

Re: [Nutch-dev] Re: nutch engines

2005-04-13 Thread Zhou LiBing
Thank you On 4/14/05, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Stefan Groschupf wrote: > > Some weeks ago I was staring to write a small tool to be able comparing > > result via command line. > > However I never finished the work, but if you like I can send you > > sources but there is still

[Nutch-dev] [jira] Updated: (NUTCH-35) modify XML parsing code in Nutch to use single API

2005-04-13 Thread Stefan Grroschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-35?page=history ] Stefan Grroschupf updated NUTCH-35: --- Attachment: xmlApiPatchIII.patch It's a shame, however I'm sure one day there will be a patch from me that just need to be assigned - I hope. :-) The p

Re: [Nutch-dev] Feature request - pluggable Analyzer

2005-04-13 Thread David Wallace
OK Jack, but the details of my analyser aren't particularly exciting. I need to index a site that has a mixture of documents in English and Te Reo Maori (indigenous language of New Zealand). Vowels in Te Reo Maori are sometimes written with short overlines (also known as macrons), to indicate a

[Nutch-dev] Re: action apis (NUTCH-27)

2005-04-13 Thread Andrzej Bialecki
Jérôme Charron wrote: Using this model is important also from another point of view: with the current code, where NutchConf is a singleton, it's not possible to run several tasks in parallel within a single JVM, but with radically different parameters. E.g.: if you want to run several CrawlTool wit

[Nutch-dev] Wiki Up!

2005-04-13 Thread Chirag Chaman
Folks, The new wiki is up and running. Basically this means the all important pages have been moved over and the FrontPage is pretty much the same as the old one. A few links were broken on the FrontPage or obsolete and those I did not move over. Given that no one said otherwise, I would say tha

[Nutch-dev] [jira] Created: (NUTCH-40) TestSegmentMergeTool fail

2005-04-13 Thread Stefan Grroschupf (JIRA)
TestSegmentMergeTool fail - Key: NUTCH-40 URL: http://issues.apache.org/jira/browse/NUTCH-40 Project: Nutch Type: Bug Reporter: Stefan Grroschupf Assigned to: Andrzej Bialecki Priority: Trivial ant clean && ant test ... 050413 22

[Nutch-dev] Re: filesystem indexing

2005-04-13 Thread Stefan Groschupf
You need to modify the jsp page in any case. What you can do as well, is to write a custom index filter plugin that adds another meta data (your link) to the document in the index. However you need to edit the jsp to show your link instead of the default url. Stefan Am 13.04.2005 um 23:16 schrie

[Nutch-dev] filesystem indexing

2005-04-13 Thread Boris Kröger
Hi all, sorry to ask the same question on the user mailing list, but I didn't get any answer to my problem. I have a filesystem with files to index. -> no problem to index the files. I want to search them remote via the WAR using Tomcat. -> no problem by moving the segments to the correct positio

[Nutch-dev] retrieving Websites using docId

2005-04-13 Thread Siva Bandhamravuri
Dear Nutch developers, Using the IndexReader, I am able to read the segments and obtain term frequencies of documents (using their ids) Now I want to actually retrieve the document data like -url, title of the document, document content etc. using the document ids. 1: How can i retrieve the d

[Nutch-dev] Re: action apis (NUTCH-27)

2005-04-13 Thread Sami Siren
Jérôme Charron wrote: Using this model is important also from another point of view: with the current code, where NutchConf is a singleton, it's not possible to run several tasks in parallel within a single JVM, but with radically different parameters. E.g.: if you want to run several CrawlTool wit

[Nutch-dev] Re: resolve or close bugs?

2005-04-13 Thread Jérôme Charron
> Since we have no QA and no test group yet, what you think about we > close bugs when releasing the next version? > Make that sense? I will take care of this process but need a rule to > follow. Here is a proposal: Since there's no QA team, why not processing as follow: 1. mark resolved once a

[Nutch-dev] Re: action apis (NUTCH-27)

2005-04-13 Thread Jérôme Charron
> Using this model is important also from another point of view: with the > current code, where NutchConf is a singleton, it's not possible to run > several tasks in parallel within a single JVM, but with radically > different parameters. E.g.: if you want to run several CrawlTool with > different

[Nutch-dev] [jira] Commented: (NUTCH-35) modify XML parsing code in Nutch to use single API

2005-04-13 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-35?page=comments#action_62762 ] Doug Cutting commented on NUTCH-35: --- TestFetcher is still failing for me with this patch: fetch of http://sourceforge.net/projects/nutch/ failed with: java.lang.NoClassDefF

[Nutch-dev] RE: MapFile.Reader bug (Re: Optimal segment size?)

2005-04-13 Thread Jay Yu
Andrzej: Thank you for your response to my comments. The reason I said there may be bug in the fetcher is that in our case there was no JVM crash or OOM Exception during the fetch and the fetch process was successful by reading the log. file. So I cannot tell what caused the truncation (Unexpected

[Nutch-dev] [jira] Resolved: (NUTCH-5) Hit limiter off-by-one bug

2005-04-13 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-5?page=history ] Doug Cutting resolved NUTCH-5: -- Resolution: Fixed I have applied this patch. Thanks, Andy! > Hit limiter off-by-one bug > -- > > Key: NUTCH-5 > UR

[Nutch-dev] MapFile.Reader bug (Re: Optimal segment size?)

2005-04-13 Thread Andrzej Bialecki
Jay Yu wrote: I have a similar problem when the segread tool (acutually any code that needs to read the seg) was just hanging there forever on a truncated segment. I think there are at least 2 bugs: one in the fetcher which generated the truncated seg without any error message, the 2nd is the Well

[Nutch-dev] [jira] Closed: (NUTCH-5) Hit limiter off-by-one bug

2005-04-13 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-5?page=history ] Doug Cutting closed NUTCH-5: > Hit limiter off-by-one bug > -- > > Key: NUTCH-5 > URL: http://issues.apache.org/jira/browse/NUTCH-5 > Project: Nutch

[Nutch-dev] Re: nutch engines

2005-04-13 Thread Doug Cutting
Stefan Groschupf wrote: Some weeks ago I was staring to write a small tool to be able comparing result via command line. However I never finished the work, but if you like I can send you sources but there is still some work to do. Mike wrote code to do this a while back. It was difficult to up

[Nutch-dev] RE: Optimal segment size?

2005-04-13 Thread Jay Yu
I have a similar problem when the segread tool (acutually any code that needs to read the seg) was just hanging there forever on a truncated segment. I think there are at least 2 bugs: one in the fetcher which generated the truncated seg without any error message, the 2nd is the MapFile/SequenceF

Re: [Nutch-dev] resolve or close bugs?

2005-04-13 Thread ogjunk-nutch
An issue is typically marked as Resolved after the patch is applied, unit tests passed, and the modified code is committed to the repository. In enterprise environments the issue is typically closed after it's been verified by the QA. In case of an open-source project there may be no need for sep

Re: AW: [Nutch-dev] Re: tools cleanup

2005-04-13 Thread Stefan Groschupf
Stephan, I already started some tests on using cli2. CLI v. 1 is in my opinion not supporting al required parameters. Can you please be more specific? I defined a interface "Tool" and created a AbstractTool class. Currently i started to change the existing tools to be extended from them. May be

[Nutch-dev] Re: resolve or close bugs?

2005-04-13 Thread Stefan Groschupf
Sure! Until working we mark the issue as in progress and when the patch is committed we mark it as as resolved. So the only thing to discuss when we should close a bug. Since we have no QA and no test group yet, what you think about we close bugs when releasing the next version? Make that sense?

[Nutch-dev] Re: Optimal segment size?

2005-04-13 Thread Andy Liu
I've had problems trying to access truncated segments in the past. The process would hang when I tried to read the segment. Have you tried using the segread tool to see if it can be accessed correctly? Have you tried reparing the segment? One week for 4 million records is way long, so I would s

[Nutch-dev] Re: Optimal segment size?

2005-04-13 Thread Piotr Kosiorowski
Hello Luke, Have you changed default values of parameters related to indexing? It helped in my case - Yesterday I was indexing ~3.5mln pages segment and it took 3.5h and optimization took 10 minutes. I am using linux (ext3) on AMD Opteron 2.2GHz +SCSI drives. I am using (probably not the best value

[Nutch-dev] Re: tools cleanup

2005-04-13 Thread Stefan Groschupf
Hi, Doug, can you or someone else please commit the classes you suggested, I think most / all agree and we can start porting things, but if all people create now own NutchConfigurable interfaces we will run in trouble and people are unhappy when they need to correct patches they submitted or pa

[Nutch-dev] Optimal segment size?

2005-04-13 Thread Luke Baker
Hey, Is there some sort of optimal or maximum segment size? I have a segment with 3.9 million records and it appears to be taking a really long time to index. The index process has been optimizing the index for over a week. The server I'm running it on is a dual Xeon 3.0 Ghz with 2GB of RAM.

[Nutch-dev] [jira] Updated: (NUTCH-5) Hit limiter off-by-one bug

2005-04-13 Thread Andy Liu (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-5?page=history ] Andy Liu updated NUTCH-5: - Attachment: fix-hitlimiting.patch Patches NutchBean to fix hit limiting off-by-one issue. > Hit limiter off-by-one bug > -- > > Key: NUTCH-

[Nutch-dev] Re: WebDBInjector and DMOZ separation

2005-04-13 Thread Doug Cutting
Please submit a patch. To construct a patch, do something like: ant test # check that there are no failures ant clean svn add src/java/org/apache/nutch/myPackage/MyClass.java svn status # make sure that you've added all new files svn diff > my.patch Doug David Spencer wrote: At a glance it seems th

[Nutch-dev] WebDBInjector and DMOZ separation

2005-04-13 Thread David Spencer
At a glance it seems that org.apache.nutch.db.WebDBInjector should (or could) have the DMOZ code taken out of it and put somewhere else, as the DMOZ code is really just a use of WebDBInjector and not essential to it and in theory there could be lots of different injectors (e.g. URLs from a DB..

[Nutch-dev] Re: Why Crawl failed to fetch so many pages?

2005-04-13 Thread Jack Tang
try +^http://news.buaa.edu.cn/* On 4/13/05, cao yuzhong <[EMAIL PROTECTED]> wrote: > > I want crawl all the pages in http://news.buaa.edu.cn > > Following is my crawl-urlfilter.txt: > > # skip file:, ftp:, & mailto: urls > -^(file|ftp|mailto|https): > > # skip image and other suffixes we can

[Nutch-dev] Re: How to do OR search in Nutch?

2005-04-13 Thread Andy Liu
You would need to make a custom query filter plugin. You'll want to look at the query-basic plugin as an example of how it constructs a Lucene query from a Nutch query. On Apr 12, 2005 12:34 AM, zhang jin <[EMAIL PROTECTED]> wrote: > If I want to support Or.How I should do? > Thanks very much! >

[Nutch-dev] Re: resolve or close bugs?

2005-04-13 Thread Doug Cutting
Stefan Groschupf wrote: I personal understand the life cycle of a issue like this: - Create an issue. - Assign an issue to a developer (optional) - Resolve a issue as soon someone start to work on this issue - Close the issue as soon the patch is in the sources I'm used to not resolving them until

[Nutch-dev] [jira] Commented: (NUTCH-35) modify XML parsing code in Nutch to use single API

2005-04-13 Thread Doug Cutting (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-35?page=comments#action_62673 ] Doug Cutting commented on NUTCH-35: --- Three unit tests fail after I apply this patch: [junit] Test org.apache.nutch.analysis.TestQueryParser FAILED [junit] Test org.a

[Nutch-dev] [jira] Commented: (NUTCH-35) modify XML parsing code in Nutch to use single API

2005-04-13 Thread Stefan Grroschupf (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-35?page=comments#action_62679 ] Stefan Grroschupf commented on NUTCH-35: Strange, I focused on the pluginsystem test case that by the way only works in case the inlude pattern allows all plugins. Ho

[Nutch-dev] Why Crawl failed to fetch so many pages?

2005-04-13 Thread cao yuzhong
I want crawl all the pages in http://news.buaa.edu.cn Following is my crawl-urlfilter.txt: # skip file:, ftp:, & mailto: urls -^(file|ftp|mailto|https): # skip image and other suffixes we can't yet parse -\.(gif|GIF|jpg|JPG|ico|ICO|css|sit|eps|wmf|rtf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe)$ # skip

[Nutch-dev] Re: action apis (NUTCH-27)

2005-04-13 Thread Andrzej Bialecki
Stefan Groschupf wrote: Hi developers, just a comment about the planed porting of tools to actions. Related to: http://issues.apache.org/jira/browse/NUTCH-27 (Patch to get a status of running Fetcher) I suggest that all action have a api to query status information. This will be very helpful for th