Applying patch NUTCH-573 ("multiple domains search") - which exactly Nutch version?

2008-01-16 Thread Arkadi.Kosmynin
Hi, Can anyone please point me to a version of Nutch (sources) compatible with this patch? I've tried to apply it to 0.9.0 available on a mirror (http://apache.wildit.net.au/lucene/nutch/), but patching fails. Nor could I apply it to 0.8.1. Is anyone actively using this patch? Is it stable?

Announcing sixearch.org

2008-01-16 Thread Le-shin Wu
Some time back we announced the first public prototype of 6S, a peer application for social, distributed, adaptive Web search. Thanks to the feedback of our early testers, we have made many improvements and today we are launching v.0.3. We invite you to visit http://sixearch.org to download this la

Re: Help: parsing pdf files

2008-01-16 Thread Martin Kuen
Hi, what comes to my mind is that there is a setting for the maximum size of a downloaded file. Have a look at "nutch-default.xml" and override it in "nutch-site.xml". pdf-files tend to be quite big (compared to html). so probably this is the source of your problem. pdf files are downloaded and ma

Re: Issues with plugin development

2008-01-16 Thread Jake
Viksit, if you're doing a crawl on a single machine check [directory_you_have_nutch_in]/logs/hadoop.log for what came out in the crawl. Using tomcat I normally find the log output for searches in catalina.out. Hope that helps, Jake. On Jan 15, 2008, at 10:47 PM, Viksit Gaur wrote: Hi

Re: How to use Nutch to parse Web-pages!

2008-01-16 Thread Tomislav Poljak
Hi, I think the simplest way to get parsed text from segment (Nutch stores parse text in segment, for example : crawl/segments/20080107120936/parse_text) to text file is dump option of segment reader: bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent -nofetch -nogenerate -nopar

Re: Customize Crawling..

2008-01-16 Thread Manoj Bist
I came across a languageidentifier plugin at PluginCentral while trying to figure out something else. *Maybe *this could be a starting point for you. http://wiki.apache.org/nutch/PluginCentral 2008/1/16 Volkan Ebil <[EMAIL PROTECTED]>: > url filter will solve the url limitation problem thanks.Is

RE: Customize Crawling..

2008-01-16 Thread Volkan Ebil
url filter will solve the url limitation problem thanks.Is anyone know how i can add an if check to the crawl process that allows only the sites that contains special chars like "ç,ü,ğ".Shoul i study on parse algoritm.