Jin,
Is it your intent to get the url list only? If it is
just one website, you can crawl the website using
nutch. Look at the "Intranet: Running the Crawl"
tutorial at
http://lucene.apache.org/nutch/tutorial8.html. Use a
very high number for depth, like 10. Once the crawl is
complete, you can ext
Hi Jim,
I have attached nutch.bat per your email. Please make
sure that you set the env variable NUTCH_HOME
appropriately in the bat file.
I am sure that you are aware of the other env
variables NUTCH_LOG_DIR/NUTCH_LOGFILE and their use.
Thanks,
Bipin
--- Jim Wilson <[EMAIL PROTECTED]> wrote:
Hi,
Please ignore my earlier question regarding the parse
command / HTMLParseFilter plugin. It was my mistake.
The HTMLParseFilter implementing plugins are called
during parse.
Thank you,
Bipin
--- Bipin Parmar <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have written a plugin i
Hi,
I have written a plugin implementing the
org.apache.nutch.parse.HtmlParseFilter extension
point. When I execute "fetch", it gets appropriately
called.
When I execute "fetch -noParsing", it does not get
called. I think this is how it is supposed to work.
However when I execute "parse", I tho
Hi,
I think that you should enable the "Language
Identification Parser/Filter" or write your own to
assign the language to each document in the index.
With this you will have just one index with each
document having language="en" or langugage="de".
Depending on whether the user is accessing your
Kerry,
cygwin is definitely a good option, however if you do
not want to use cygwin, I can send the nutch.bat file.
Could you please let me know the version of nutch you
are using. There is a small difference in running
nutch 0.7.2, 0.8-dev (june version which uses
nutch.log.dir jvm param) and 0.8
they would migrate to nutch 0.8 production
release.
Thanks again to both of you.
Bipin Parmar
--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> TDLN wrote:
> > Unfortunately this is only feasible with *a lot*
> of custom code.
> > Probably you will be done sooner refetchi