Re: [Nutch-general] How to config nutch just crawl html links?

jim shirreffs Fri, 13 Apr 2007 05:54:00 -0700

----- Original Message ----- 
From: "Ratnesh,V2Solutions India" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Friday, April 13, 2007 12:12 AM
Subject: Re: How to config nutch just crawl html links?



>
> O.K
> I am not able to see parse-js in the value, include this also there, and
> might be the problem may arise since you are trying to parse java script
> pages and its content type is contentType=application/x-javascript.
> and at the time of parsing it finds the pages having java script but it 
> does
> not recognize parser for this to parse.
>
> so i will recommend you to see parse-plugins.xml and check to see mime 
> type
> and alias name of the parse-js plugin something like this
>
> <mimeType name="application/x-javascript">
> <plugin id="parse-js" />
> <plugin id="parse-text" />
> </mimeType>
>
> and
>
> <alias name="parse-js" extension-id="JSParser" />
>
> I think checking this , and inclusion of parse-js within nutch-default.xml
> and nutch-site.xml will solve the problem
>
> "Ratnesh,V2Solutions, India"
>
>
>
>
>
> Meryl Silverburgh wrote:
>>
>> Thanks. I change my nutch-default.xml to the following:
>>
>> <property>
>>   <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> </property>
>>
>> But I still see this error message, I don't expect it tries to fetch
>> js files at all.
>>
>> Error parsing: http://www.cnn.com/exchange/submit/pokkariJavascript.js:
>> failed(2,200): org.apache.nutch.parse.ParseException: parser not found
>> for contentType=application/x-javascript
>> url=http://www.cnn.com/exchange/submit/pokkariJavascript.js
>>
>>
>> And why it fetch rss file too?
>>
>> fetching http://rss.cnn.com/rss/cnn_ireports.rss
>>
>>
>>
>> Any help is appreciated.
>>
>>
>>
>> On 4/12/07, Ratnesh,V2Solutions India
>> <[EMAIL PROTECTED]> wrote:
>>>
>>> HI,
>>> what you can do is remove parse-js and other related plugin from
>>> nutch-site.xml file and nutch-default.xml file both .
>>> but its not recommended to do change in nutch-default.xml , though
>>> sometimes
>>> without changing in nutch-default.xml , it does not affect .
>>>
>>> so you see what the changes you can do according to the requirement I am
>>> sure once you remove the parse-js It wount crawl javascript and try
>>> removing
>>> other plugins as parse-msword etc.
>>>
>>> I hope that it will done
>>>
>>> Ratnesh,V2Solutions,India
>>>
>>>
>>>
>>> Meryl Silverburgh wrote:
>>> >
>>> > Hi,
>>> >
>>> > How can I configure nutch just crawl html links (no images, no
>>> > javascript files, no css files)?
>>> > And it won't record in the crawl database for non html pages links.
>>> >
>>> > thank you.
>>> >
>>> >
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/How-to-config-nutch-just-crawl-html-links--tf3562947.html#a9957697
>>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>>
>>>
>>
>>
>
> -- 
> View this message in context: 
> http://www.nabble.com/How-to-config-nutch-just-crawl-html-links--tf3562947.html#a9972986
> Sent from the Nutch - User mailing list archive at Nabble.com.
> 


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Re: [Nutch-general] How to config nutch just crawl html links?

Reply via email to