Thanks Lewis!!
-Original message-
From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com
Sent: Monday 18th August 2014 22:36
To: user@nutch.apache.org; d...@nutch.apache.org
Subject: [RELEASE] Apache Nutch 1.9
Hi Everyone,
The Apache Nutch PMC are pleased to announce the immediate release
hi all,
I want to crawl a json file from a url.
I use wget url ,and find the result file has wrong encoding characters
about Chinese words . And the I
run iconv -f gbk -t utf-8 file.json , and get the correct result.
Then , I use nutch. Use the readseg dump to get the result. The
ParseText
Hi S.L.,
1. Nutch will follow site's robots.txt file as default, maybe you can take
a look at robot rule for the missing domains by going to
http://example.com/robots.txt?
2. Also, there are some URL filters that will be applied, maybe you can
paste the output after you inject the seed.txt
Thanks,the problem is that if I reduce the URLs in the seed list to any 5 ,
all of them are being crawled , which tells me its not a URL filtering
issue , is just seems Nutch is not able to crawl more than 5 domains from
the seed list , is there a property that I am setting by mistake that's
Here here, great job dudes
++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email:
Hi All,
I have a failure in one of the applications consistently after my Nutch job
runs for like an hour, can some please suggest why this error is occurring
looking at the exception message below.
Diagnostics:
Application application_1408512952691_0017 failed 2 times due to AM
Container for
6 matches
Mail list logo