Re: Nutch ScoringFilter plugin problems
Hello, I still have the same problem. I have the following piece of code if (linkdb == null) { System.out.println(Null linkdb); } else { System.out.println(LinkDB not null); } Inlinks inlinks = linkdb.getInlinks(url); System.out.println(a); On the output I can see it always prints LinkDB not null, so linkdb is not null. But a never gets printed, so I guess that at: Inlinks inlinks = linkdb.getInlinks(url); there is some error. Maybe the getInlinks function throws an IOException? I do catch the IOException, but the catch block is never executed either. One question, how should I create the LinkDBReader? I do it the following way: linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); Is it right? Thanks. On Wed, Jan 21, 2009 at 10:16 AM, Pau pau...@gmail.com wrote: Ok, I think you are right, maybe inlinks is null. I will try it now. Thank you! I have no information about the exception. It seems that simply the program skips this part of the code... maybe a ScoringFilterExcetion is thrown? On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney doga...@gmail.com wrote: On Tue, Jan 20, 2009 at 7:18 PM, Pau pau...@gmail.com wrote: Hello, I want to create a new ScoringFilter plugin. In order to evaluate how interesting a web page is, I need information about the link structure in the LinkDB. In the method updateDBScore, I have the following lines (among others): 88linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); ... 99System.out.println(Inlinks to + url); 100Inlinks inlinks = linkdb.getInlinks(url); 101System.out.println(a); 102IteratorInlink iIt = inlinks.iterator(); 103System.out.println(b); a always gets printed, but b rarely gets printed, so this seems that in line 102 an error happens, and an exeception is raised. Do you know why this is happening? What am I doing wrong? Thanks. Maybe there are no inlinks to that page so inlinks is null? What is the exception exactly? -- Doğacan Güney
Re: Nutch ScoringFilter plugin problems
Ok, I think you are right, maybe inlinks is null. I will try it now. Thank you! I have no information about the exception. It seems that simply the program skips this part of the code... maybe a ScoringFilterExcetion is thrown? On Wed, Jan 21, 2009 at 9:47 AM, Doğacan Güney doga...@gmail.com wrote: On Tue, Jan 20, 2009 at 7:18 PM, Pau pau...@gmail.com wrote: Hello, I want to create a new ScoringFilter plugin. In order to evaluate how interesting a web page is, I need information about the link structure in the LinkDB. In the method updateDBScore, I have the following lines (among others): 88linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); ... 99System.out.println(Inlinks to + url); 100Inlinks inlinks = linkdb.getInlinks(url); 101System.out.println(a); 102IteratorInlink iIt = inlinks.iterator(); 103System.out.println(b); a always gets printed, but b rarely gets printed, so this seems that in line 102 an error happens, and an exeception is raised. Do you know why this is happening? What am I doing wrong? Thanks. Maybe there are no inlinks to that page so inlinks is null? What is the exception exactly? -- Doğacan Güney
Nutch ScoringFilter plugin problems
Hello, I want to create a new ScoringFilter plugin. In order to evaluate how interesting a web page is, I need information about the link structure in the LinkDB. In the method updateDBScore, I have the following lines (among others): 88linkdb = new LinkDbReader(getConf(), new Path(crawl/linkdb)); ... 99System.out.println(Inlinks to + url); 100Inlinks inlinks = linkdb.getInlinks(url); 101System.out.println(a); 102IteratorInlink iIt = inlinks.iterator(); 103System.out.println(b); a always gets printed, but b rarely gets printed, so this seems that in line 102 an error happens, and an exeception is raised. Do you know why this is happening? What am I doing wrong? Thanks.
Troubles while creating a plugin
Hello, I am creating a plugin for Nutch that extends the QueryFilter. I get a successful compilation with ant and ant war, but when I do a search, I get the following exception: 26/11/2008 18:50:07 org.apache.catalina.core.StandardWrapperValve invoke SEVERE: Servlet.service() for servlet jsp threw exception java.lang.NoClassDefFoundError: org/apache/commons/codec/DecoderException at org.apache.tika.mime.MimeTypesReader.readMatch(MimeTypesReader.java:272) at org.apache.tika.mime.MimeTypesReader.readMatches(MimeTypesReader.java:221) at org.apache.tika.mime.MimeTypesReader.readMagic(MimeTypesReader.java:201) at org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:164) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:138) at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:121) at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:56) at org.apache.nutch.util.MimeUtil.init(MimeUtil.java:62) at org.apache.nutch.protocol.Content.init(Content.java:85) at org.apache.nutch.personalizedsearch.searcher.context.ContextQueryFilter.filter(ContextQueryFilter.java:55) at org.apache.nutch.searcher.QueryFilters.filter(QueryFilters.java:111) at org.apache.nutch.searcher.IndexSearcher.search(IndexSearcher.java:96) at org.apache.nutch.searcher.NutchBean.search(NutchBean.java:251) at org.apache.jsp.search_jsp._jspService(search_jsp.java:284) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:393) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:320) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:266) at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:290) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:175) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:128) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:263) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:844) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:584) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:619) The DecoderException class is in commons-codec-1.3.jar, so I added the jar file to my plugin.xml: runtime !-- As defined in build.xml this plugin will end up bundled as recommended.jar -- library name=personalized-search.jar export name=*/ /library library name=commons-codec-1.3.jar / /runtime But the same error appears. Any idea on what I may be doing wrong? Thanks.
Retrieving text content from html files
Hello, I am developing a Nutch plugin that needs to read the text content from some URL's. I think that parse-html plugin contains the necessary code to do so, but I don't know what methods to use and how to use them. What should I do? Thanks.
Re: Writing a plugin
Hi, I have added my plugin (called recommended) to nutch-site.xml but it seems that Nutch is not using it. I say this because when search for recom I get no results, but there is a page that has the meta-tag: meta name=recommended content=recom/ I have attached my nutch-site.xml and nutch-default.xml files, maybe you see something wrong. Apart from that, my plugin compiles ok, but when I run ant test I get errors. I have also attached the output for ant test. On Sun, May 11, 2008 at 8:08 PM, [EMAIL PROTECTED] wrote: Hi, Yes, you have to add your plugin to nutch-site.xml, along with other plugins you probably already have defined there. If you don't have them in nutch-site.xml, look at nutch-default.xml Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Pau [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Sunday, May 11, 2008 8:28:53 AM Subject: Writing a plugin Hello, I am following the WritingPluginExample-0.9 and I am a bit confused about how to get nutch to use my plugin. In the section called Getting Ant to Compile Your Plugin it says: The next time you run a crawl your parser and index filter should get used. But at the end of the document, there is another section called Getting Nutch to Use Your Plugin. Do I have to edit the nutch-site.xml file as Getting Nutch to Use Your Plugin says? Or it is not necessary? Thank you. ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Put site-specific property overrides in this file. -- configuration property namehttp.agent.name/name valuePauSpider/value descriptionHTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. /description /property property namehttp.agent.description/name valueNutch Crawler/value descriptionFurther description of our bot- this text is used in the User-Agent header. It appears in parenthesis after the agent name. /description /property property namehttp.agent.email/name value[EMAIL PROTECTED]/value descriptionDescription/description /property property nameplugin.includes/name valuerecommended|protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)/value descriptionRegular expression naming plugin id names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. /description /property /configuration ?xml version=1.0? ?xml-stylesheet type=text/xsl href=configuration.xsl? !-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the License); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an AS IS BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. -- !-- Do not modify this file directly. Instead, copy entries that you -- !-- wish to modify from this file into nutch-site.xml and change them -- !-- there. If nutch-site.xml does not already exist, create it. -- configuration !-- file properties -- property namefile.content.limit/name value65536/value descriptionThe length limit for downloaded content, in bytes. If this value is nonnegative (=0), content longer than it will be truncated; otherwise, no truncation at all. /description /property property namefile.content.ignored/name valuetrue/value descriptionIf true, no file content will be saved during fetch. And it is probably what we want to set most of time, since file:// URLs are meant to be local and we can always use them directly at parsing and indexing stages. Otherwise file contents will be saved. !! NO IMPLEMENTED YET !! /description /property !-- HTTP properties -- property namehttp.agent.name/name value/value descriptionHTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents
Re: Problem compiling plugins
Thank you very much! It worked. I just downloaded ant-trax.jar and added this file into the ant home's lib directory. Then ant war was successful. On Fri, May 9, 2008 at 7:00 PM, [EMAIL PROTECTED] wrote: Hi, You are missing some ant jars. I'm not sure which ones, but it looks like the class that cannot be found is TraXLiaison , so once you google you'll find which optional ant jar this is in. Get that jar, put it in your ant home's lib dir and try again. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Pau [EMAIL PROTECTED] To: nutch-dev@lucene.apache.org Sent: Friday, May 9, 2008 4:32:08 AM Subject: Problem compiling plugins Hello, I have to implement a plugin for Nutch 0.9, so I have followed the WritingPluginExample-0.9. When I try to compile the plugins I get warnings about nutch-extensionpoints.jar: [jar] Warning: skipping jar archive /home/pau/Pau/Master/Tesis/nutch-0.9/build/nutch-extensionpoints/nutch-extensionpoints.jar because no files were in Why do I get this warning? Furthermore, when I try to compile the .war file with the command 'ant war', I get the following error: generate-locale: [echo] Generating docs for locale=ca [xslt] java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.TraXLiaison [xslt] at java.net.URLClassLoader$1.run(URLClassLoader.java:200) [xslt] at java.security.AccessController.doPrivileged(Native Method) [xslt] at java.net.URLClassLoader.findClass(URLClassLoader.java:188) [xslt] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [xslt] at java.lang.ClassLoader.loadClass(ClassLoader.java:251) [xslt] at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) [xslt] at java.lang.Class.forName0(Native Method) [xslt] at java.lang.Class.forName(Class.java:169) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.loadClass(XSLTProcess.java:548) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.resolveProcessor(XSLTProcess.java:533) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.getLiaison(XSLTProcess.java:785) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.execute(XSLTProcess.java:300) [xslt] at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) [xslt] at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) [xslt] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [xslt] at java.lang.reflect.Method.invoke(Method.java:597) [xslt] at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:105) [xslt] at org.apache.tools.ant.Task.perform(Task.java:348) [xslt] at org.apache.tools.ant.Target.execute(Target.java:357) [xslt] at org.apache.tools.ant.Target.performTasks(Target.java:385) [xslt] at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1329) [xslt] at org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:38) [xslt] at org.apache.tools.ant.Project.executeTargets(Project.java:1181) [xslt] at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:416) [xslt] at org.apache.tools.ant.taskdefs.CallTarget.execute(CallTarget.java:105) [xslt] at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) [xslt] at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) [xslt] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [xslt] at java.lang.reflect.Method.invoke(Method.java:597) [xslt] at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:105) [xslt] at org.apache.tools.ant.Task.perform(Task.java:348) [xslt] at org.apache.tools.ant.Target.execute(Target.java:357) [xslt] at org.apache.tools.ant.Target.performTasks(Target.java:385) [xslt] at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1329) [xslt] at org.apache.tools.ant.Project.executeTarget(Project.java:1298) [xslt] at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) [xslt] at org.apache.tools.ant.Project.executeTargets(Project.java:1181) [xslt] at org.apache.tools.ant.Main.runBuild(Main.java:698) [xslt] at org.apache.tools.ant.Main.startAnt(Main.java:199) [xslt] at org.apache.tools.ant.launch.Launcher.run(Launcher.java:257) [xslt] at org.apache.tools.ant.launch.Launcher.main(Launcher.java:104) BUILD FAILED /home/pau/Pau/Master/Tesis/nutch-0.9/build.xml:442: The following error occurred while executing this line
Problem compiling plugins
Hello, I have to implement a plugin for Nutch 0.9, so I have followed the WritingPluginExample-0.9. When I try to compile the plugins I get warnings about nutch-extensionpoints.jar: [jar] Warning: skipping jar archive /home/pau/Pau/Master/Tesis/nutch-0.9/build/nutch-extensionpoints/nutch-extensionpoints.jar because no files were in Why do I get this warning? Furthermore, when I try to compile the .war file with the command 'ant war', I get the following error: generate-locale: [echo] Generating docs for locale=ca [xslt] java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.TraXLiaison [xslt] at java.net.URLClassLoader$1.run(URLClassLoader.java:200) [xslt] at java.security.AccessController.doPrivileged(Native Method) [xslt] at java.net.URLClassLoader.findClass(URLClassLoader.java:188) [xslt] at java.lang.ClassLoader.loadClass(ClassLoader.java:306) [xslt] at java.lang.ClassLoader.loadClass(ClassLoader.java:251) [xslt] at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) [xslt] at java.lang.Class.forName0(Native Method) [xslt] at java.lang.Class.forName(Class.java:169) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.loadClass(XSLTProcess.java:548) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.resolveProcessor(XSLTProcess.java:533) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.getLiaison(XSLTProcess.java:785) [xslt] at org.apache.tools.ant.taskdefs.XSLTProcess.execute(XSLTProcess.java:300) [xslt] at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) [xslt] at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) [xslt] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [xslt] at java.lang.reflect.Method.invoke(Method.java:597) [xslt] at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:105) [xslt] at org.apache.tools.ant.Task.perform(Task.java:348) [xslt] at org.apache.tools.ant.Target.execute(Target.java:357) [xslt] at org.apache.tools.ant.Target.performTasks(Target.java:385) [xslt] at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1329) [xslt] at org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:38) [xslt] at org.apache.tools.ant.Project.executeTargets(Project.java:1181) [xslt] at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:416) [xslt] at org.apache.tools.ant.taskdefs.CallTarget.execute(CallTarget.java:105) [xslt] at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:288) [xslt] at sun.reflect.GeneratedMethodAccessor1.invoke(Unknown Source) [xslt] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [xslt] at java.lang.reflect.Method.invoke(Method.java:597) [xslt] at org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:105) [xslt] at org.apache.tools.ant.Task.perform(Task.java:348) [xslt] at org.apache.tools.ant.Target.execute(Target.java:357) [xslt] at org.apache.tools.ant.Target.performTasks(Target.java:385) [xslt] at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1329) [xslt] at org.apache.tools.ant.Project.executeTarget(Project.java:1298) [xslt] at org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41) [xslt] at org.apache.tools.ant.Project.executeTargets(Project.java:1181) [xslt] at org.apache.tools.ant.Main.runBuild(Main.java:698) [xslt] at org.apache.tools.ant.Main.startAnt(Main.java:199) [xslt] at org.apache.tools.ant.launch.Launcher.run(Launcher.java:257) [xslt] at org.apache.tools.ant.launch.Launcher.main(Launcher.java:104) BUILD FAILED /home/pau/Pau/Master/Tesis/nutch-0.9/build.xml:442: The following error occurred while executing this line: /home/pau/Pau/Master/Tesis/nutch-0.9/build.xml:408: java.lang.ClassNotFoundException: org.apache.tools.ant.taskdefs.optional.TraXLiaison Could you please help me with it? Thank you very much.