Hi,
Can anyone please point me to a version of Nutch (sources) compatible
with this patch? I've tried to apply it to 0.9.0 available on a mirror
(http://apache.wildit.net.au/lucene/nutch/), but patching fails. Nor
could I apply it to 0.8.1.
Is anyone actively using this patch? Is it stable?
Some time back we announced the first public prototype of 6S, a peer
application for social, distributed, adaptive Web search. Thanks to the
feedback of our early testers, we have made many improvements and today we
are launching v.0.3. We invite you to visit http://sixearch.org to download
this la
Hi,
what comes to my mind is that there is a setting for the maximum size of a
downloaded file.
Have a look at "nutch-default.xml" and override it in "nutch-site.xml".
pdf-files tend to be quite big (compared to html). so probably this is the
source of your problem.
pdf files are downloaded and ma
Viksit,
if you're doing a crawl on a single machine check
[directory_you_have_nutch_in]/logs/hadoop.log for what came out in the
crawl. Using tomcat I normally find the log output for searches in
catalina.out.
Hope that helps,
Jake.
On Jan 15, 2008, at 10:47 PM, Viksit Gaur wrote:
Hi
Hi,
I think the simplest way to get parsed text from segment (Nutch stores
parse text in segment, for example :
crawl/segments/20080107120936/parse_text) to text file is dump option of
segment reader:
bin/nutch readseg -dump crawl/segments/20080107120936 dump -nocontent
-nofetch -nogenerate -nopar
I came across a languageidentifier plugin at PluginCentral while trying to
figure out something else. *Maybe *this could be a starting point for you.
http://wiki.apache.org/nutch/PluginCentral
2008/1/16 Volkan Ebil <[EMAIL PROTECTED]>:
> url filter will solve the url limitation problem thanks.Is
url filter will solve the url limitation problem thanks.Is anyone know how i
can add an if check to the crawl process that allows only the sites that
contains special chars like "ç,ü,ğ".Shoul i study on parse algoritm.