----- Original Message -----
From: "Ratnesh,V2Solutions India" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, April 13, 2007 12:12 AM
Subject: Re: How to config nutch just crawl html links?
O.K
I am not able to see parse-js in the value, include this also there, and
might be the problem may arise since you are trying to parse java script
pages and its content type is contentType=application/x-javascript.
and at the time of parsing it finds the pages having java script but it
does
not recognize parser for this to parse.
so i will recommend you to see parse-plugins.xml and check to see mime
type
and alias name of the parse-js plugin something like this
<mimeType name="application/x-javascript">
<plugin id="parse-js" />
<plugin id="parse-text" />
</mimeType>
and
<alias name="parse-js" extension-id="JSParser" />
I think checking this , and inclusion of parse-js within nutch-default.xml
and nutch-site.xml will solve the problem
"Ratnesh,V2Solutions, India"
Meryl Silverburgh wrote:
Thanks. I change my nutch-default.xml to the following:
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>
But I still see this error message, I don't expect it tries to fetch
js files at all.
Error parsing: http://www.cnn.com/exchange/submit/pokkariJavascript.js:
failed(2,200): org.apache.nutch.parse.ParseException: parser not found
for contentType=application/x-javascript
url=http://www.cnn.com/exchange/submit/pokkariJavascript.js
And why it fetch rss file too?
fetching http://rss.cnn.com/rss/cnn_ireports.rss
Any help is appreciated.
On 4/12/07, Ratnesh,V2Solutions India
<[EMAIL PROTECTED]> wrote:
HI,
what you can do is remove parse-js and other related plugin from
nutch-site.xml file and nutch-default.xml file both .
but its not recommended to do change in nutch-default.xml , though
sometimes
without changing in nutch-default.xml , it does not affect .
so you see what the changes you can do according to the requirement I am
sure once you remove the parse-js It wount crawl javascript and try
removing
other plugins as parse-msword etc.
I hope that it will done
Ratnesh,V2Solutions,India
Meryl Silverburgh wrote:
>
> Hi,
>
> How can I configure nutch just crawl html links (no images, no
> javascript files, no css files)?
> And it won't record in the crawl database for non html pages links.
>
> thank you.
>
>
--
View this message in context:
http://www.nabble.com/How-to-config-nutch-just-crawl-html-links--tf3562947.html#a9957697
Sent from the Nutch - User mailing list archive at Nabble.com.
--
View this message in context:
http://www.nabble.com/How-to-config-nutch-just-crawl-html-links--tf3562947.html#a9972986
Sent from the Nutch - User mailing list archive at Nabble.com.