In our experience, we use flavors of Nutch RPC, RMI and Externalizable.
RMI has been easy to implement when only one server needs to be accessed
(such as a status check) and class has many functions.
The Nutch RPC is excellent for distribution -- yes one needs to serialize by
hand and create the
What do others think?
I think, RMI isn't a good idea. I waste a lot of time with it. I
like the nutch rpc very much.
However I think usage of Externalizable is a good idea, first it is a
very small change.
Second many users use nutch for very custom things and usage of
Externalizable mak
Stefan Groschupf wrote:
can someone please tell me what is the technical difference between
org.apache.nutch.io.Writable and java.io.Externalizable?
For me that looks very similar and Externalizable is available since
jdk 1.1.
What do I miss?
You don't miss much!
I avoided using Java's bui
Hi,
can someone please tell me what is the technical difference between
org.apache.nutch.io.Writable and java.io.Externalizable?
For me that looks very similar and Externalizable is available since
jdk 1.1.
What do I miss?
Thanks for any hints.
Stefan
[EMAIL PROTECTED] wrote:
- http://www.nutch.org/docs/en/bot.html
+ http://lucene.apache.org/nutch/bot.html
I think this should now be:
http://lucene.apache.org/nutch/bot.html
The docs/en pages have mostly been reduced to the "about" page, whose
translations I hate to throw away, even thoug
+1
Piotr Kosiorowski wrote:
Hello,
We should probably change user agent string in nutch-default.xml to
point to Apache site. The only question is http.agent.version - should
we set it to 0.07 for release and 0.08-dev for future work? I do not
know how it was used previously.
Current values:
Hello,
We should probably change user agent string in nutch-default.xml to
point to Apache site. The only question is http.agent.version - should
we set it to 0.07 for release and 0.08-dev for future work? I do not
know how it was used previously.
Current values:
http.agent.url
http://ww
#2 from your response:
I'm not yet sure how disk
> failures appear to a JVM. Things are currently written so that if an
> exception is thrown during disk i/o then the datanode should take itself
> offline, initiating replication of its data. We'll see if that's
> sufficient.
the data is replicate
Jay Pound wrote:
1.) we need to split up chunks of data into sub-folders as not to run the
filesystem out of its physical limitations of concurrent files in a single
directory, like the way squid splits up its data into directories.
I agree. I am currently using reiser with NDFS so this is no
No problem for me. I have just run the test crawl on
http://lucene.apache.org/nutch as described in new tutorial and a lot
of pdf and png files were causing big exceptions and stack traces in
log. I thought that people (usually using nutch for the first time)
might think that they did something
[EMAIL PROTECTED] wrote:
Skipping png and pdf files.
I think the undocumented convention has been to, by default, still fetch
content types that are not parsed by default, but which may be parsed by
simply enabling a plugin. That way folks need to only change one place
in order to, e.g., st
is there any way to filter results to english via search, so I can setup a
multi-language search, I thought I saw somewhere that you could put
something into the form of the html, a switch while submiting the form that
would use a plugin to filter the results? I know I had seen some benchmarks
on a
Here's a better way
http://([a-z0-9]*\.)*.(com|org|net|biz|edu|biz|mil|us|info|cc)/
FYI, this will not remove non-English sites -- but international sites that
follow the two-letter convention.
CC-
-Original Message-
From: Jay Pound [mailto:[EMAIL PROTECTED]
Sent: Monday, August 08, 2
I would like a confirmation from someone that this will work,
I've edited the regex filter in hopes to weed out non-english sites from my
search results, I'll be testing pruning on my current 40mil index to see if
it works there, or maybe there is a way to set the search to return only
english resu
Thanks. I will add it to Wiki (but not today).
P.
Doug Cutting wrote:
Piotr Kosiorowski wrote:
So I have installed forrest and modified
src/site/src/documentation/content/xdocs.
Than run 'forrest'. And it generated content in src/site/build/site.
And now the questions:
Should I copy src/site
+1
Piotr Kosiorowski wrote:
Hello,
Some time ago someone mentioned on the list a problem with nutch
tutorial (I cannot find this email now). I have checked it today and
he/she was right. If you follow the nutch Intranet Crawling tutorial
you will end up with not very interesting index.
This is
Piotr Kosiorowski wrote:
So I have installed forrest and modified
src/site/src/documentation/content/xdocs.
Than run 'forrest'. And it generated content in src/site/build/site.
And now the questions:
Should I copy src/site/build/site to site and commit it?
Yes. I'm impressed that you got th
Thanks. It works.
Piotr
Doug Cutting wrote:
Piotr Kosiorowski wrote:
Looking around in JIRA I found out I cannot resolve an issue. I am
not sure how it works but I suspect I lack some rights to do so. Am I
right?
I have added you to the nutch-developers Jira group. Now you should be
able
Piotr Kosiorowski wrote:
Looking around in JIRA I found out I cannot resolve an issue. I am not
sure how it works but I suspect I lack some rights to do so. Am I right?
I have added you to the nutch-developers Jira group. Now you should be
able to resolve issues, etc.
Doug
Hello,
I just created an issue in JIRA
http://issues.apache.org/jira/browse/NUTCH-79 containing the code for
fault tolerant searching. I think it is too late to include it in 0.7
release but I would wait for comments and test it in the meantime.
I would like to commit it when release would be d
A very basic facility seem to be missing in Nutch. If I have a 2000
urls list in Nutch DB and want to ignore external links, I have to
build a regex-filter with thousands of different domain I want to
crawl. No parameter to only crawl the different domain and ignore
external links.
At these t
I got it to work now, it wasent selecting the directory I had chosen, so I
typed it in and it works fine
BTW very cool tool
-J
- Original Message -
From: "Fredrik Andersson" <[EMAIL PROTECTED]>
To:
Sent: Sunday, August 07, 2005 6:16 PM
Subject: Re: luke??
> That's odd, Luke is working g
Piotr Kosiorowski wrote:
I can commit such changes for 0.7 release (it means today) if I got
positive feedback from other committers.
+1
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __
[__ || __|__/|__||\/| Information Retrieval, Semantic
Hello,
Some time ago someone mentioned on the list a problem with nutch
tutorial (I cannot find this email now). I have checked it today and
he/she was right. If you follow the nutch Intranet Crawling tutorial
you will end up with not very interesting index.
This is because it recommends users to
Hi,
actually my Searcher is running on my Nutch made Indexed.
Everything seems to work out:
So I go on with a main part of my app.
Before Nutch I used Arachnid as a crawler.
During Crawling I used my Method
/**
* Each page considered to be inserted in the sitemap graph is stored in
25 matches
Mail list logo