RE: [RELEASE] Apache Nutch 1.9

2014-08-20 Thread Markus Jelsma
Thanks Lewis!! -Original message- From: Lewis John Mcgibbneylewis.mcgibb...@gmail.com Sent: Monday 18th August 2014 22:36 To: user@nutch.apache.org; d...@nutch.apache.org Subject: [RELEASE] Apache Nutch 1.9 Hi Everyone, The Apache Nutch PMC are pleased to announce the immediate release

Nutch 1.7 content encoding problem

2014-08-20 Thread adu
hi all, I want to crawl a json file from a url. I use wget url ,and find the result file has wrong encoding characters about Chinese words . And the I run iconv -f gbk -t utf-8 file.json , and get the correct result. Then , I use nutch. Use the readseg dump to get the result. The ParseText

Re: Nutch not crawling all the domains in the seed list.

2014-08-20 Thread Bin Wang
Hi S.L., 1. Nutch will follow site's robots.txt file as default, maybe you can take a look at robot rule for the missing domains by going to http://example.com/robots.txt? 2. Also, there are some URL filters that will be applied, maybe you can paste the output after you inject the seed.txt

Re: Nutch not crawling all the domains in the seed list.

2014-08-20 Thread S.L
Thanks,the problem is that if I reduce the URLs in the seed list to any 5 , all of them are being crawled , which tells me its not a URL filtering issue , is just seems Nutch is not able to crawl more than 5 domains from the seed list , is there a property that I am setting by mistake that's

Re: [RELEASE] Apache Nutch 1.9

2014-08-20 Thread Mattmann, Chris A (3980)
Here here, great job dudes ++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email:

Nutch 1.7 failing on Hadoop YARN after running for a while.

2014-08-20 Thread S.L
Hi All, I have a failure in one of the applications consistently after my Nutch job runs for like an hour, can some please suggest why this error is occurring looking at the exception message below. Diagnostics: Application application_1408512952691_0017 failed 2 times due to AM Container for