[Nutch-dev] (no subject)

2005-02-28 Thread Egor Chernodarov
Hello! Monday, February 28, 2005, 1:53:07 AM, you wrote: Y> 2. How to get the number of pages in DB? try "bin/nutch readdb dbpath -stats" -- Best regards, mailto:[EMAIL PROTECTED] Chernodarov Egor --

Re: [Nutch-dev] Crawl did not finish

2005-02-28 Thread Olaf Thiele
Hi Paul, the general list would indeed be more appropriate, but let's finish the thread here. That makes it easier for everyone. It sounds like you have not been building your index yet. I recommend following the tutorial steps closely, if you are a first time user. After you built the index, you

[Nutch-dev] CVS to SVN at Apache

2005-02-28 Thread Doug Cutting
It turns out that Apache won't import our CVS history into SVN. So, unless someone argues against it, tomorrow I will: - disable commits to Nutch's Sourceforge CVS repository - rename all packages from net.nutch to org.apache.nutch - change the license to Apache - commit Nutch's sources to A

Re: [Nutch-dev] Crawl did not finish

2005-02-28 Thread sub paul
Hi Olaf, Thanks for the reply. I am trying the same crawl with depth 5, and it has been going on for about 3 hours. So its a good sign. I read in the the nutch website somewhere that even if the crawl is not finished, you can use the results from whatever it has. I tried doing just that, and all

Re: [Nutch-dev] Re: [Nutch-general] Question about normalizing urls

2005-02-28 Thread Olaf Thiele
Hi Shri, if you look at regex-normalize.xml you will notice that jsession ids are not taken out as a default. Your session id seems to be quite long. If you are sure of the format, you may try inserting the following rule (;)jsessionid=[a-zA-Z0-9!-]{64}(?) Are all links of the same style or

Re: [Nutch-dev] search suggestion

2005-02-28 Thread Olaf Thiele
Hi Will, I agree with you that this is an intriguing idea, but 1. Look at http://www.eurekster.com/ 2. Privacy Issues (see gmail) 3. Too many sites If you feel nevertheless inclined towards programming this, go ahead. I will be one of the first to test your new interface and give feedback. Kind

Re: [Nutch-dev] Crawl did not finish

2005-02-28 Thread Olaf Thiele
Hi Paul, the most likely problem seems to me the depth of 15. If your first page and every consecutive one had 10 links, your crawler would have to fetch roughly 24414062500 GigyByte from the Internet. Depending on your data, start with a much smaller depth. Kind regards, Olaf On Mon, 28 Feb

Re: [Nutch-dev] inject breaks with "java.lang.StackOverflowError"

2005-02-28 Thread Doug Cutting
Sorry for the inconvenience. I just fixed this bug in CVS, although it may take it a few hours to reach the public CVS. Look for version 1.7 at: http://cvs.sourceforge.net/viewcvs.py/nutch/nutch/src/java/net/nutch/io/WritableComparator.java Sorry. My bad. Doug -

[Nutch-dev] [ nutch-Bugs-1152281 ] WritableComparator goes to never-ending loop

2005-02-28 Thread SourceForge.net
Bugs item #1152281, was opened at 2005-02-26 02:32 Message generated for change (Comment added) made by cutting You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1152281&group_id=59548 Category: None Group: None >Status: Closed >Resolution: Fixed Priority:

[Nutch-dev] Crawl did not finish

2005-02-28 Thread sub paul
Hello All, I was running an intranet crawl and It seems like it did not finish, properly. It is a pretty default setup, but crawl's depth was 15, and I had turned on queries by commenting out # skip URLs containing certain characters as probable queries, etc. [EMAIL PROTECTED] other than bunch o

SOLVED (Re: [Nutch-dev] Problems: mergesegs & segread breaks)

2005-02-28 Thread Andrzej Bialecki
Matthias Jaekle wrote: Thanks Andrzej! Matthias Andrzej Bialecki schrieb: Matthias Jaekle wrote: Hi Andrzej, I have copied the 2 segments to: http://www.iventax.de/NEW/ You might wget -r them. Both of them are around 2 GB. If I should tar them or something else, please let me know. Thanks for your

[Nutch-dev] 「気持ちイイッ!」と感じまくり!

2005-02-28 Thread 初々しく淡いオマ○コ
[EMAIL PROTECTED] ¡¡-™ [EMAIL PROTECTED]@[EMAIL PROTECTED]@‘f‘«Eƒpƒ“ƒXƒgEƒnƒCƒq[ƒ‹ [EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@`‚ß‚­‚é‚ß‚­ƒtƒFƒ`‰æ‘œE“®‰æ‚Ì•óŒÉ`--

[Nutch-dev] Indexing performance?

2005-02-28 Thread Matthias Jaekle
Hi, I tried to index a segment with around 5.000.000 urls. This is running with 0.31 rec/s. Seems to bee very slow. When I was running indexing on smaller segments (500.000 docs) it seemed to be much faster. Are there any explanations? I use a 1.4 Ghz Athlon with 1 GB RAM. What is a good segment

[Nutch-dev] 「気持ちイイッ!」と感じまくり!

2005-02-28 Thread 初々しく淡いオマ○コ
[EMAIL PROTECTED] ¡¡-™ [EMAIL PROTECTED]@[EMAIL PROTECTED]@‘f‘«Eƒpƒ“ƒXƒgEƒnƒCƒq[ƒ‹ [EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@[EMAIL PROTECTED]@`‚ß‚­‚é‚ß‚­ƒtƒFƒ`‰æ‘œE“®‰æ‚Ì•óŒÉ`--