----- Original Message ----- From: "Ratnesh,V2Solutions India" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, April 13, 2007 12:12 AM Subject: Re: How to config nutch just crawl html links?
> > O.K > I am not able to see parse-js in the value, include this also there, and > might be the problem may arise since you are trying to parse java script > pages and its content type is contentType=application/x-javascript. > and at the time of parsing it finds the pages having java script but it > does > not recognize parser for this to parse. > > so i will recommend you to see parse-plugins.xml and check to see mime > type > and alias name of the parse-js plugin something like this > > <mimeType name="application/x-javascript"> > <plugin id="parse-js" /> > <plugin id="parse-text" /> > </mimeType> > > and > > <alias name="parse-js" extension-id="JSParser" /> > > I think checking this , and inclusion of parse-js within nutch-default.xml > and nutch-site.xml will solve the problem > > "Ratnesh,V2Solutions, India" > > > > > > Meryl Silverburgh wrote: >> >> Thanks. I change my nutch-default.xml to the following: >> >> <property> >> <name>plugin.includes</name> >> >> <value>protocol-http|urlfilter-regex|parse-(html)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >> </property> >> >> But I still see this error message, I don't expect it tries to fetch >> js files at all. >> >> Error parsing: http://www.cnn.com/exchange/submit/pokkariJavascript.js: >> failed(2,200): org.apache.nutch.parse.ParseException: parser not found >> for contentType=application/x-javascript >> url=http://www.cnn.com/exchange/submit/pokkariJavascript.js >> >> >> And why it fetch rss file too? >> >> fetching http://rss.cnn.com/rss/cnn_ireports.rss >> >> >> >> Any help is appreciated. >> >> >> >> On 4/12/07, Ratnesh,V2Solutions India >> <[EMAIL PROTECTED]> wrote: >>> >>> HI, >>> what you can do is remove parse-js and other related plugin from >>> nutch-site.xml file and nutch-default.xml file both . >>> but its not recommended to do change in nutch-default.xml , though >>> sometimes >>> without changing in nutch-default.xml , it does not affect . >>> >>> so you see what the changes you can do according to the requirement I am >>> sure once you remove the parse-js It wount crawl javascript and try >>> removing >>> other plugins as parse-msword etc. >>> >>> I hope that it will done >>> >>> Ratnesh,V2Solutions,India >>> >>> >>> >>> Meryl Silverburgh wrote: >>> > >>> > Hi, >>> > >>> > How can I configure nutch just crawl html links (no images, no >>> > javascript files, no css files)? >>> > And it won't record in the crawl database for non html pages links. >>> > >>> > thank you. >>> > >>> > >>> >>> -- >>> View this message in context: >>> http://www.nabble.com/How-to-config-nutch-just-crawl-html-links--tf3562947.html#a9957697 >>> Sent from the Nutch - User mailing list archive at Nabble.com. >>> >>> >> >> > > -- > View this message in context: > http://www.nabble.com/How-to-config-nutch-just-crawl-html-links--tf3562947.html#a9972986 > Sent from the Nutch - User mailing list archive at Nabble.com. > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
