[Nutch-dev] there is a danger with IndexSegment.java Re: [Nutch-cvs] nutch/src/java/net/nutch/indexer DeleteDuplicates.java,1.14,1.15 IndexMerger.java,1.6,1.7 IndexSegment.java,1.20,1.21

2004-10-04 Thread john
Hi, Mike, There is a danger with newest IndexSegment.java, if option '-dir' is accidentally specified with dirs like ./ In fact I just did $ ./bin/nutch index -local ./try//segments/20041001123721 -dir ./ and lost my work of two hours :-< FileUtil.fullyDelete(workingDir) is the culprit. Some saf

[Nutch-dev] Question: Nutch Based Meta-Crawlers

2004-10-04 Thread Yousef Ourabi
Hello, I have a quick question, does anyone have any experiance or insight into the engineering difficulties with building a fast meta-crawler, as opposed to building an actual search engine. Could nutch be used to pass-on queires to the search engine of the day using the apporopriate plug-in? What

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Jason Boss
Hi Andrzej, I hate to be a pain. I patched my SegmentMergeTool.java. I also added the import line at the top. public long size = 0L; public SegmentReader(File dir) throws Exception { fetcherReader = new ArrayFile.Reader(new LocalFileSystem(), new File(dir, FetcherOutput.DIR_NAME

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Andrzej Bialecki
Jason Boss wrote: Andrzej, Does this patch resolve the EOF issues? Yes, I hope so... the exception was thrown when the ArrayFile.Reader tries to read an entry from the 'data' file pointed to by the 'index' file, but the seek to that file position exceeds the actual length of the 'data' file...

[Nutch-dev] [ nutch-Bugs-1039516 ] Failure to Updatedb with NDFS

2004-10-04 Thread SourceForge.net
Bugs item #1039516, was opened at 2004-10-03 11:45 Message generated for change (Comment added) made by mike_cafarella You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1039516&group_id=59548 Category: ndfs Group: None Status: Open Resolution: None Priority

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Jason Boss
Andrzej, Does this patch resolve the EOF issues? Jason - Original Message - From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, October 04, 2004 1:45 PM Subject: Re: [Nutch-dev] mergesegs errors > Jason Boss wrote: > > > Doug and Andrzej, > > > > What d

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Andrzej Bialecki
Jason Boss wrote: Hi Doug, Just got done using the new nutch-nightly build. I FINALLY got through 4 segments, but still have issues with segments that didn't "finish" normally. Meaning for some reason either the crawler died or I had to do a CTRL C on a crawl. How do you merge segments that have

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Jason Boss
Hi Doug, Just got done using the new nutch-nightly build. I FINALLY got through 4 segments, but still have issues with segments that didn't "finish" normally. Meaning for some reason either the crawler died or I had to do a CTRL C on a crawl. How do you merge segments that have the bad EOF at th

Re: [Nutch-dev] releases

2004-10-04 Thread Andy Hedges
Doug wrote: That is also presently the case. Yes, sorry I should have pointed that out :$ --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tell us what you think of them. Give us Your

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Andrzej Bialecki
Jason Boss wrote: Doug and Andrzej, What do I need to do to get my local system working? Can I use the new version or do I still need to wait for a revised patch? Pls wait - Mike C. was doing some changes concurrently, and we need to resolve our versions. In the meantime you can use the attached

Re: [Nutch-dev] releases

2004-10-04 Thread Doug Cutting
Andy Hedges wrote: Maybe it goes without saying but all the unit test should pass. Of course. That is also presently the case. Doug --- This SF.net email is sponsored by: IT Product Guide on ITManagersJournal Use IT products in your business? Tel

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Doug Cutting
Have you tried the new nightly that I just built? Doug Jason Boss wrote: Doug and Andrzej, What do I need to do to get my local system working? Can I use the new version or do I still need to wait for a revised patch? Thanks, Jason - Original Message - From: "Andrzej Bialecki" <[EMAIL PRO

Re: [Nutch-dev] releases

2004-10-04 Thread Andy Hedges
Doug Cutting wrote: Is there anything that folks feel we need before we can make a 1.0 release? I think Nutch is getting pretty usable, and think a 1.0 release might encourage more folks to use it. Maybe it goes without saying but all the unit test should pass. You mentioned moving to subvers

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Jason Boss
Doug and Andrzej, What do I need to do to get my local system working? Can I use the new version or do I still need to wait for a revised patch? Thanks, Jason - Original Message - From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, October 04, 2004 11:00

[Nutch-dev] releases

2004-10-04 Thread Doug Cutting
There's a new "nightly" build in http://www.nutch.org/release/nightly/. Please give it a try and report any problems. I'd like to make a 0.6 release soon, and, if that goes well, make a 1.0 release. Is there anything that folks feel we need before we can make a 1.0 release? I think Nutch is

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Andrzej Bialecki
Doug Cutting wrote: Andrzej Bialecki wrote: Please be patient - I just got back from a trip, and I have a patch for this, but I need to test it first before committing - another couple of hours. Mike just submitted a revision of the NutchFS API & implementation that should fix this. Sorry, And

Re: [Nutch-dev] Indexing links to robots.txt blocked pages

2004-10-04 Thread Doug Cutting
Andrew Chen wrote: So one thing I want to do is that when a page is linked, but is blocked by a robots.txt, I want it to index the link to the page but not the actual text itself. Google does this - it just shows as an entry: foo.blah.com/ Similar pages Right now in Nutch, the default behavio

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Doug Cutting
Andrzej Bialecki wrote: Please be patient - I just got back from a trip, and I have a patch for this, but I need to test it first before committing - another couple of hours. Mike just submitted a revision of the NutchFS API & implementation that should fix this. Sorry, Andrzej, if this replica

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Jason Boss
Thank goodness...I thought I was going nuts on this. Hopefully this patch will fix all of my weekend headaches. Thanks! Jason - Original Message - From: "Andrzej Bialecki" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, October 04, 2004 3:53 AM Subject: Re: [Nutch-dev] merges

Re: [Nutch-dev] mergesegs errors

2004-10-04 Thread Andrzej Bialecki
Jason Boss wrote: Hi, How do I go about troubleshooting this?" Please be patient - I just got back from a trip, and I have a patch for this, but I need to test it first before committing - another couple of hours. -- Best regards, Andrzej Bialecki

[Nutch-dev] Help

2004-10-04 Thread Azam Jalali
I’m a MSC student. My thesis is retrieved document clustering. I had a large collection. I want to use a free code of search method for retrieving documents. I prefer using vector based model with tfidf weighting. Do you apply standard methods for search? Which protocol do you employ? Can I use yo

Re: [Nutch-dev] Re: Indexing links to robots.txt blocked pages

2004-10-04 Thread Dawid Weiss
Interesting, really interesting. I didn't mean to argue about the philosophy, I'm just stunned they do index pages blocked by robots.txt; I think it should not be the case. Maybe it's a mix of our both points -- they index links to all pages (as I suggested based on my gissiped what's-behind-go

Re: [Nutch-dev] Re: Indexing links to robots.txt blocked pages

2004-10-04 Thread Andrew Chen
Dawid, Thanks for your e-mail. I hadn't thought of it that way - it still strikes me that one could make an argument that any content on open pages is fair game, even if it is protected by robots.txt somewhere else. But instead of arguing philosophy, let me give you a Google example, using a rath