Hi, Mike,
There is a danger with newest IndexSegment.java, if option '-dir' is
accidentally specified with dirs like ./
In fact I just did
$ ./bin/nutch index -local ./try//segments/20041001123721 -dir ./
and lost my work of two hours :-<
FileUtil.fullyDelete(workingDir) is the culprit.
Some saf
Hello,
I have a quick question, does anyone have any
experiance or insight into the engineering
difficulties with building a fast meta-crawler, as
opposed to building an actual search engine. Could
nutch be used to pass-on queires to the search engine
of the day using the apporopriate plug-in? What
Hi Andrzej,
I hate to be a pain.
I patched my SegmentMergeTool.java.
I also added the import line at the top.
public long size = 0L;
public SegmentReader(File dir) throws Exception {
fetcherReader = new ArrayFile.Reader(new LocalFileSystem(), new
File(dir, FetcherOutput.DIR_NAME
Jason Boss wrote:
Andrzej,
Does this patch resolve the EOF issues?
Yes, I hope so... the exception was thrown when the ArrayFile.Reader
tries to read an entry from the 'data' file pointed to by the 'index'
file, but the seek to that file position exceeds the actual length of
the 'data' file...
Bugs item #1039516, was opened at 2004-10-03 11:45
Message generated for change (Comment added) made by mike_cafarella
You can respond by visiting:
https://sourceforge.net/tracker/?func=detail&atid=491356&aid=1039516&group_id=59548
Category: ndfs
Group: None
Status: Open
Resolution: None
Priority
Andrzej,
Does this patch resolve the EOF issues?
Jason
- Original Message -
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, October 04, 2004 1:45 PM
Subject: Re: [Nutch-dev] mergesegs errors
> Jason Boss wrote:
>
> > Doug and Andrzej,
> >
> > What d
Jason Boss wrote:
Hi Doug,
Just got done using the new nutch-nightly build. I FINALLY got through 4
segments, but still have issues with segments that didn't "finish" normally.
Meaning for some reason either the crawler died or I had to do a CTRL C on a
crawl. How do you merge segments that have
Hi Doug,
Just got done using the new nutch-nightly build. I FINALLY got through 4
segments, but still have issues with segments that didn't "finish" normally.
Meaning for some reason either the crawler died or I had to do a CTRL C on a
crawl. How do you merge segments that have the bad EOF at th
Doug wrote:
That is also presently the case.
Yes, sorry I should have pointed that out :$
---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tell us what you think of them. Give us
Your
Jason Boss wrote:
Doug and Andrzej,
What do I need to do to get my local system working? Can I use the new
version or do I still need to wait for a revised patch?
Pls wait - Mike C. was doing some changes concurrently, and we need to
resolve our versions. In the meantime you can use the attached
Andy Hedges wrote:
Maybe it goes without saying but all the unit test should pass.
Of course. That is also presently the case.
Doug
---
This SF.net email is sponsored by: IT Product Guide on ITManagersJournal
Use IT products in your business? Tel
Have you tried the new nightly that I just built?
Doug
Jason Boss wrote:
Doug and Andrzej,
What do I need to do to get my local system working? Can I use the new
version or do I still need to wait for a revised patch?
Thanks,
Jason
- Original Message -
From: "Andrzej Bialecki" <[EMAIL PRO
Doug Cutting wrote:
Is there anything that folks feel we need before we can make a 1.0
release? I think Nutch is getting pretty usable, and think a 1.0
release might encourage more folks to use it.
Maybe it goes without saying but all the unit test should pass. You
mentioned moving to subvers
Doug and Andrzej,
What do I need to do to get my local system working? Can I use the new
version or do I still need to wait for a revised patch?
Thanks,
Jason
- Original Message -
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, October 04, 2004 11:00
There's a new "nightly" build in http://www.nutch.org/release/nightly/.
Please give it a try and report any problems. I'd like to make a 0.6
release soon, and, if that goes well, make a 1.0 release.
Is there anything that folks feel we need before we can make a 1.0
release? I think Nutch is
Doug Cutting wrote:
Andrzej Bialecki wrote:
Please be patient - I just got back from a trip, and I have a patch
for this, but I need to test it first before committing - another
couple of hours.
Mike just submitted a revision of the NutchFS API & implementation that
should fix this. Sorry, And
Andrew Chen wrote:
So one thing I want to do is that when a page is linked, but is
blocked by a robots.txt, I want it to index the link to the page but
not the actual text itself. Google does this - it just shows as an
entry:
foo.blah.com/
Similar pages
Right now in Nutch, the default behavio
Andrzej Bialecki wrote:
Please be patient - I just got back from a trip, and I have a patch for
this, but I need to test it first before committing - another couple of
hours.
Mike just submitted a revision of the NutchFS API & implementation that
should fix this. Sorry, Andrzej, if this replica
Thank goodness...I thought I was going nuts on this.
Hopefully this patch will fix all of my weekend headaches.
Thanks!
Jason
- Original Message -
From: "Andrzej Bialecki" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Monday, October 04, 2004 3:53 AM
Subject: Re: [Nutch-dev] merges
Jason Boss wrote:
Hi,
How do I go about troubleshooting this?"
Please be patient - I just got back from a trip, and I have a patch for
this, but I need to test it first before committing - another couple of
hours.
--
Best regards,
Andrzej Bialecki
Im a MSC student. My thesis is retrieved document clustering. I had a large collection. I want to use a free code of search method for retrieving documents. I prefer using vector based model with tfidf weighting. Do you apply standard methods for search? Which protocol do you employ? Can I use yo
Interesting, really interesting. I didn't mean to argue about the
philosophy, I'm just stunned they do index pages blocked by robots.txt;
I think it should not be the case. Maybe it's a mix of our both points
-- they index links to all pages (as I suggested based on my gissiped
what's-behind-go
Dawid,
Thanks for your e-mail. I hadn't thought of it that way - it still
strikes me that one could make an argument that any content on open
pages is fair game, even if it is protected by robots.txt somewhere
else.
But instead of arguing philosophy, let me give you a Google example,
using a rath
23 matches
Mail list logo