Re: How to crawl every urls on a website?

2006-09-12 Thread Bipin Parmar
Jin, Is it your intent to get the url list only? If it is just one website, you can crawl the website using nutch. Look at the "Intranet: Running the Crawl" tutorial at http://lucene.apache.org/nutch/tutorial8.html. Use a very high number for depth, like 10. Once the crawl is complete, you can ext

Re: Windows Native Launching?

2006-09-11 Thread Bipin Parmar
Hi Jim, I have attached nutch.bat per your email. Please make sure that you set the env variable NUTCH_HOME appropriately in the bat file. I am sure that you are aware of the other env variables NUTCH_LOG_DIR/NUTCH_LOGFILE and their use. Thanks, Bipin --- Jim Wilson <[EMAIL PROTECTED]> wrote:

Re: HTMLParseFilter is not called by ParseSegment (nutch parse command)

2006-08-09 Thread Bipin Parmar
Hi, Please ignore my earlier question regarding the parse command / HTMLParseFilter plugin. It was my mistake. The HTMLParseFilter implementing plugins are called during parse. Thank you, Bipin --- Bipin Parmar <[EMAIL PROTECTED]> wrote: > Hi, > > I have written a plugin i

HTMLParseFilter is not called by ParseSegment (nutch parse command)

2006-08-09 Thread Bipin Parmar
Hi, I have written a plugin implementing the org.apache.nutch.parse.HtmlParseFilter extension point. When I execute "fetch", it gets appropriately called. When I execute "fetch -noParsing", it does not get called. I think this is how it is supposed to work. However when I execute "parse", I tho

Re: Could we configure nutch-site.xml with two directories?

2006-07-18 Thread Bipin Parmar
Hi, I think that you should enable the "Language Identification Parser/Filter" or write your own to assign the language to each document in the index. With this you will have just one index with each document having language="en" or langugage="de". Depending on whether the user is accessing your

RE: Nutch on Windows

2006-07-14 Thread Bipin Parmar
Kerry, cygwin is definitely a good option, however if you do not want to use cygwin, I can send the nutch.bat file. Could you please let me know the version of nutch you are using. There is a small difference in running nutch 0.7.2, 0.8-dev (june version which uses nutch.log.dir jvm param) and 0.8

Re: Migrating crawled data (urls) from version 0.7.1 to 0.8-dev.

2006-06-23 Thread Bipin Parmar
they would migrate to nutch 0.8 production release. Thanks again to both of you. Bipin Parmar --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > TDLN wrote: > > Unfortunately this is only feasible with *a lot* > of custom code. > > Probably you will be done sooner refetchi