Re: Hi
Did u check crawl-urlfilter.txt? All the domain names that you'd like to crawl have to mentioned. e.g. # accept hosts in MY.DOMAIN.NAME +^http://([a-z0-9]*\.)*mersin\.edu\.tr/ +^http://([a-z0-9]*\.)*tubitak\.gov\.tr/ Also check property db.ignore.external.links in nutch-default.xml. Should be set to false. 2010/5/5 Zehra Göçer > > i have problems about nutch.my project is link analysis i crawled " > www.mersin.edu.tr" and i analyse linkdb and i saw all about > mersin.edu.trlinks.But i have to find other links in site example > www.tubitak.gov.tr bu i cannot find?i have to find these links ?please > help me > _ > Yeni Windows 7: Size en uygun bilgisayarı bulun. Daha fazla bilgi edinin. > http://windows.microsoft.com/shop
Re: Format of the Nutch Results
I think you need to specify the individual segment.. bin/nutch readseg -dump crawl-20100420112025/segments/20100422092816 dumpSegmentDirectory On Wed, Apr 21, 2010 at 9:38 PM, nachonieto3 wrote: > > Thank you a lot! Now I'm working on that, I have some doubts more...I'm not > able to run the command readseg...I've been consulting some help forum and > the basic synthesis is > readseg > I have the segments in this path: > D:\nutch-0.9\crawl-20100420112025\segments > The file named crawl-20100420112025 is the one where are stored the > segments. So I'm trying to execute the command using these but none is > working: > readseg d/nutch-0.9/crawl-20100420112025/segments > readseg crawl-20100420112025/segments > readseg crawl-20100420112025 > > What I'm doing wrong??When I try to execute I get bash: readseg:command not > found. > Any idea??Thank you in advance. > -- > View this message in context: > http://n3.nabble.com/Format-of-the-Nutch-Results-tp729918p739952.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: AbstractMethodError for cyberneko parser
Thanks Julien. I have changed nutch-site.xml to have only parse-(tika) instead of parse-(text | html | js | tika) in plugin.includes property. It works now as it doesn't pick up any other parser besides tika. On Wed, Apr 21, 2010 at 7:42 PM, Julien Nioche < lists.digitalpeb...@gmail.com> wrote: > Hi Harry, > > Could you try using parse-tika instead and see if you are getting the same > problem? I gather from your email that you are using Nutch 1.1 or the SVN > version, so parse-tika should be used by default. Have you deactivated it? > > Thanks > > Julien > > On 21 April 2010 11:58, Harry Nutch wrote: > > > Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to > > fix the problem. > > > > On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch > wrote: > > > > > Hi, > > > > > > I am running the latest version for nutch. While crawling one > particular > > > site I get a AbstractMethodError in the cyberneko plugin for all of it > > pages > > > when doing a Fetch. > > > As i understand, this has to do because of difference between the > runtime > > > and compile version. However, I am running it afresh after an ant > clean. > > > > > > Any suggestions would be helpful. Btw, i am using java version > "1.6.0_18" > > > on a windows environment > > > > > > > > > java.lang.AbstractMethodError: > > > org.cyberneko.html.HTMLScanner.getCharacterOffset > > > ()I > > > at > org.apache.xerces.xni.parser.XMLParseException.(Unknown > > > Source) > > > > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT > > > MLConfiguration.java:673) > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo > > > nfiguration.java:662) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2404) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2360) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc > > > anner.java:2267) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 > > > 820) > > > at > > > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) > > > at > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 > > > ) > > > at > > > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 > > > ) > > > at > > > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. > > > java:164) > > > at > > > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) > > > > > > at > > > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) > > > at > > > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) > > > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > > > at > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 > > > 9) > > > at > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > > > java.lang.AbstractMethodError: > > > org.cyberneko.html.HTMLScanner.getCharacterOffset > > > ()I > > > at > org.apache.xerces.xni.parser.XMLParseException.(Unknown > > > Source) > > > > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT > > > MLConfiguration.java:673) > > > at > > > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo > > > nfiguration.java:662) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2404) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > > > er.java:2360) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc > > > anner.java:2267) > > > at > > > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 > > > 820) > > > at > > > org.cyberneko.html.HTMLScanner.scanD
Re: AbstractMethodError for cyberneko parser
Replacing the current xercesimpl.jar with the one from nutch 1.0 seems to fix the problem. On Wed, Apr 21, 2010 at 3:14 PM, Harry Nutch wrote: > Hi, > > I am running the latest version for nutch. While crawling one particular > site I get a AbstractMethodError in the cyberneko plugin for all of it pages > when doing a Fetch. > As i understand, this has to do because of difference between the runtime > and compile version. However, I am running it afresh after an ant clean. > > Any suggestions would be helpful. Btw, i am using java version "1.6.0_18" > on a windows environment > > > java.lang.AbstractMethodError: > org.cyberneko.html.HTMLScanner.getCharacterOffset > ()I > at org.apache.xerces.xni.parser.XMLParseException.(Unknown > Source) > > at > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT > MLConfiguration.java:673) > at > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo > nfiguration.java:662) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > er.java:2404) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > er.java:2360) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc > anner.java:2267) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 > 820) > at > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) > at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 > ) > at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 > ) > at > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. > java:164) > at > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) > > at > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) > at > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 > 9) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > java.lang.AbstractMethodError: > org.cyberneko.html.HTMLScanner.getCharacterOffset > ()I > at org.apache.xerces.xni.parser.XMLParseException.(Unknown > Source) > > at > org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT > MLConfiguration.java:673) > at > org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo > nfiguration.java:662) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > er.java:2404) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann > er.java:2360) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc > anner.java:2267) > at > org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 > 820) > at > org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) > at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 > ) > at > org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 > ) > at > org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. > java:164) > at > org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) > > at > org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) > at > org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) > at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 > 9) > at > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) > > >
AbstractMethodError for cyberneko parser
Hi, I am running the latest version for nutch. While crawling one particular site I get a AbstractMethodError in the cyberneko plugin for all of it pages when doing a Fetch. As i understand, this has to do because of difference between the runtime and compile version. However, I am running it afresh after an ant clean. Any suggestions would be helpful. Btw, i am using java version "1.6.0_18" on a windows environment java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646) java.lang.AbstractMethodError: org.cyberneko.html.HTMLScanner.getCharacterOffset ()I at org.apache.xerces.xni.parser.XMLParseException.(Unknown Source) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.createException(HT MLConfiguration.java:673) at org.cyberneko.html.HTMLConfiguration$ErrorReporter.reportError(HTMLCo nfiguration.java:662) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2404) at org.cyberneko.html.HTMLScanner$ContentScanner.scanAttribute(HTMLScann er.java:2360) at org.cyberneko.html.HTMLScanner$ContentScanner.scanStartElement(HTMLSc anner.java:2267) at org.cyberneko.html.HTMLScanner$ContentScanner.scan(HTMLScanner.java:1 820) at org.cyberneko.html.HTMLScanner.scanDocument(HTMLScanner.java:789) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:478 ) at org.cyberneko.html.HTMLConfiguration.parse(HTMLConfiguration.java:431 ) at org.cyberneko.html.parsers.DOMFragmentParser.parse(DOMFragmentParser. java:164) at org.apache.nutch.parse.html.HtmlParser.parseNeko(HtmlParser.java:249) at org.apache.nutch.parse.html.HtmlParser.parse(HtmlParser.java:212) at org.apache.nutch.parse.html.HtmlParser.getParse(HtmlParser.java:145) at org.apache.nutch.parse.ParseUtil.parse(ParseUtil.java:82) at org.apache.nutch.fetcher.Fetcher$FetcherThread.output(Fetcher.java:87 9) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:646)
Re: nutch says No URLs to fetch - check your seed list and URL filters when trying to index fmforums.com
Did you check robots.txt On Wed, Apr 21, 2010 at 7:57 AM, joshua paul wrote: > after getting this email, I tried commenting out this line in > regex-urlfilter.txt = > > #-[...@=] > > but it didn't help... i still get same message - no urls to feth > > > regex-urlfilter.txt = > > # skip URLs containing certain characters as probable queries, etc. > -[...@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > # accept anything else > +. > > crawl-urlfilter.txt = > > # skip URLs containing certain characters as probable queries, etc. > # we don't want to skip > #-[...@=] > > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > > > +^http://([a-z0-9]*\.)*fmforums.com/ > > # skip everything else > -. > > > arkadi.kosmy...@csiro.au wrote on 2010-04-20 4:49 PM: > > What is in your regex-urlfilter.txt? >> >> >> >>> -Original Message- >>> From: joshua paul [mailto:jos...@neocodesoftware.com] >>> Sent: Wednesday, 21 April 2010 9:44 AM >>> To: nutch-user@lucene.apache.org >>> Subject: nutch says No URLs to fetch - check your seed list and URL >>> filters when trying to index fmforums.com >>> >>> nutch says No URLs to fetch - check your seed list and URL filters when >>> trying to index fmforums.com. >>> >>> I am using this command: >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>> >>> - urls directory contains urls.txt which contains >>> http://www.fmforums.com/ >>> - crawl-urlfilter.txt contains +^http://([a-z0-9]*\.)*fmforums.com/ >>> >>> Note - my nutch setup indexes other sites fine. >>> >>> For example I am using this command: >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 50 >>> >>> - urls directory contains urls.txt which contains >>> http://dispatch.neocodesoftware.com >>> - crawl-urlfilter.txt contains >>> +^http://([a-z0-9]*\.)*dispatch.neocodesoftware.com/ >>> >>> And nutch generates a good crawl. >>> >>> How can I troubleshoot why nutch says "No URLs to fetch"? >>> >>> >> > -- > catching falling stars... > > https://www.linkedin.com/in/joshuascottpaul > MSN coga...@hotmail.com AOL neocodesoftware > Yahoo joshuascottpaul Skype neocodesoftware > Toll Free 1.888.748.0668 Fax 1-866-336-7246 > #238 - 425 Carrall St YVR BC V6B 6E3 CANADA > > www.neocodesoftware.com store.neocodesoftware.com > www.monicapark.ca www.digitalpostercenter.com > >
Re: Format of the Nutch Results
try bin/nutch on the console. It will give you a list of commands. You could use them to read segments e.g bin/nutch readdb .. On Mon, Apr 19, 2010 at 11:36 PM, nachonieto3 wrote: > > I have a doubt...How are the final results of Nutch stored?I mean, in which > format is stored the information contained in the links analyzed? > > I understood that Nutch need the information in plan text to parse it...but > in which format is stored finally?I know is stored in "segments" but how > can > I access to this information in order to convert it to plan text?Is it > possible? > > Thank you in advance > > > -- > View this message in context: > http://n3.nabble.com/Format-of-the-Nutch-Results-tp729918p729918.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
Re: nutch 1.1 crawl d/n complete issue
I am new to nutch and still trying to figure out the code flow, however, as a work around to issue #1, after the crawl finishes you could run linkdb and index command separately from cygwin. $bin/nutch invertlinks crawl/linkdb -dir crawl/segments $ bin/nutch index crawl/indexes crawl/crawldb/ crawl/linkdb crawl/segments/20100415163946 crawl/segments/20100415164106 This seems to work for me. You may have already tried this workaround, but just in case. -Harry On Fri, Apr 16, 2010 at 3:34 AM, matthew a. grisius wrote: > Two observations using the nutch 1.1. nightly build > nutch-2010-04-14_04-00-47: > > 1) Previously I was using nutch 1.0 to crawl successfully, but had > problems w/ parse-pdf. I decided to try nutch 1.1. w/ parse-tika, which > appears to parse all of the 'problem' pdfs that parse-pdf could not > handle. The crawldb and segments directories are created and appear to > be valid. However, the overall crawl does not finish now: > > nutch crawl urls/urls -dir crawl -depth 10 > ... > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: crawl/crawldb > CrawlDb update: segments: [crawl/segments/20100415015102] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Exception in thread "main" java.lang.NullPointerException >at org.apache.nutch.crawl.Crawl.main(Crawl.java:133) > > Nutch 1.0 would complete like this: > > nutch crawl urls/urls -dir crawl -depth 10 > ... > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=7 - no more URLs to fetch. > LinkDb: starting > LinkDb: linkdb: crawl/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225731 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225644 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225749 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225808 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225713 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225937 > LinkDb: adding segment: > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/segments/20100414225656 > LinkDb: done > Indexer: starting > Indexer: done > Dedup: starting > Dedup: adding indexes in: crawl/indexes > Dedup: done > merging indexes to: crawl/index > Adding > file:/home/mag/Desktop/nutch/nutch-1.0/bin/crawl/indexes/part-0 > done merging > crawl finished: crawl > > Any ideas? > > > 2) if there is a 'space' in any component dir then $NUTCH_OPTS is > invalid and causes this problem: > > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> nutch > crawl urls/urls -dir crawl -depth 10 -topN 10 > NUTCH_OPTS: -Dhadoop.log.dir=/home/mag/Desktop/untitled > folder/nutch-2010-04-14_04-00-47/logs -Dhadoop.log.file=hadoop.log > -Djava.library.path=/home/mag/Desktop/untitled > folder/nutch-2010-04-14_04-00-47/lib/native/Linux-i386-32 > Exception in thread "main" java.lang.NoClassDefFoundError: > folder/nutch-2010-04-14_04-00-47/logs > Caused by: java.lang.ClassNotFoundException: > folder.nutch-2010-04-14_04-00-47.logs >at java.net.URLClassLoader$1.run(URLClassLoader.java:200) >at java.security.AccessController.doPrivileged(Native Method) >at java.net.URLClassLoader.findClass(URLClassLoader.java:188) >at java.lang.ClassLoader.loadClass(ClassLoader.java:307) >at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) >at java.lang.ClassLoader.loadClass(ClassLoader.java:252) >at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) > Could not find the main class: folder/nutch-2010-04-14_04-00-47/logs. > Program will exit. > m...@fp:~/Desktop/untitled folder/nutch-2010-04-14_04-00-47/bin> > > Obviously the work around is to rename 'untitled folder' to > 'untitledFolderWithNoSpaces' > > Thanks, any help w/b appreciated w/ issue #1 above. > > -m. > > >