Hi, On 6/29/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > I have tried the NUTCH-444 "feed" plugin to enable spidering of RSS feeds: > /nutch-2007-06-27_06-52-44/plugins/feed > (that's a recent nightly build of nutch). > > When I attempt a crawl I get an IOException: > > $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 > crawl started in: /usr/tmp/lee_apollo > rootUrlDir = /usr/tmp/lee_urls.txt > threads = 10 > depth = 2 > Injector: starting > Injector: crawlDb: /usr/tmp/lee_apollo/crawldb > Injector: urlDir: /usr/tmp/lee_urls.txt > Injector: Converting injected urls to crawl db entries. > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > at org.apache.nutch.crawl.Injector.inject(Injector.java:162) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:115) > 3.14 real 1.92 user 0.30 sys
This stack trace is not useful. This is only JobTracker (or LocalJobRunner) reporting back to us that job has failed. If you are running in a distributed environment, check your tasktracker logs or if you are running locally check out logs/hadoop.log. > > The seed URL is: > http://www.mt-olympus.com/apollo/feed/ > > I enabled the feed plugin via this property in nutch-site.xml: > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opi > c|urlnormalizer-(pass|regex|basic)|feed</value> > <description>Regular expression naming plugin directory names to > include. Any plugin not matching this expression is excluded. > In any case you need at least include the nutch-extensionpoints plugin. By > default Nutch includes crawling just HTML and plain text via HTTP, > and basic indexing and search plugins. In order to use HTTPS please enable > protocol-httpclient, but be aware of possible intermittent problems with the > underlying commons-httpclient library. > </description> > </property> > > > As a sanity check, when I take out "feed" from <value> above, it no longer > throws an exception (but it also doesn't fetch anything): > > $ nutch crawl /usr/tmp/lee_urls.txt -dir /usr/tmp/lee_apollo -depth 2 2>&1 | > tee crawl.log > crawl started in: /usr/tmp/lee_apollo > rootUrlDir = /usr/tmp/lee_urls.txt > threads = 10 > depth = 2 > Injector: starting > Injector: crawlDb: /usr/tmp/lee_apollo/crawldb > Injector: urlDir: /usr/tmp/lee_urls.txt > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: /usr/tmp/lee_apollo/segments/20070628155854 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls by host, for politeness. > Generator: done. > Fetcher: starting > Fetcher: segment: /usr/tmp/lee_apollo/segments/20070628155854 > Fetcher: threads: 10 > fetching http://www.mt-olympus.com/apollo/feed/ > Fetcher: done > CrawlDb update: starting > CrawlDb update: db: /usr/tmp/lee_apollo/crawldb > CrawlDb update: segments: [/usr/tmp/lee_apollo/segments/20070628155854] > CrawlDb update: additions allowed: true > CrawlDb update: URL normalizing: true > CrawlDb update: URL filtering: true > CrawlDb update: Merging segment data into db. > CrawlDb update: done > Generator: Selecting best-scoring urls due for fetch. > Generator: starting > Generator: segment: /usr/tmp/lee_apollo/segments/20070628155907 > Generator: filtering: false > Generator: topN: 2147483647 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=1 - no more URLs to fetch. > LinkDb: starting > LinkDb: linkdb: /usr/tmp/lee_apollo/linkdb > LinkDb: URL normalize: true > LinkDb: URL filter: true > LinkDb: adding segment: /usr/tmp/lee_apollo/segments/20070628155854 > LinkDb: done > Indexer: starting > Indexer: linkdb: /usr/tmp/lee_apollo/linkdb > Indexer: adding segment: /usr/tmp/lee_apollo/segments/20070628155854 > Indexing [http://www.mt-olympus.com/apollo/feed/] with analyzer [EMAIL > PROTECTED] (null) > Optimizing index. > merging segments _ram_0 (1 docs) into _0 (1 docs) > [EMAIL PROTECTED] Thread-36: now checkpoint "segments_2" [isCommit = true] > [EMAIL PROTECTED] Thread-36: IncRef "_0.fnm": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: IncRef "_0.fdx": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: IncRef "_0.fdt": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: IncRef "_0.tii": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: IncRef "_0.tis": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: IncRef "_0.frq": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: IncRef "_0.prx": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: IncRef "_0.nrm": pre-incr count is 0 > [EMAIL PROTECTED] Thread-36: deleteCommits: now remove commit "segments_1" > [EMAIL PROTECTED] Thread-36: DecRef "segments_1": pre-decr count is 1 > [EMAIL PROTECTED] Thread-36: delete "segments_1" > Indexer: done > Dedup: starting > Dedup: adding indexes in: /usr/tmp/lee_apollo/indexes > Dedup: done > merging indexes to: /usr/tmp/lee_apollo/index > Adding /usr/tmp/lee_apollo/indexes/part-00000 > done merging > crawl finished: /usr/tmp/lee_apollo > 30.45 real 8.40 user 2.26 sys > > > ----- Original Message ---- > From: Doğacan Güney <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wednesday, June 27, 2007 10:59:52 PM > Subject: Re: Possibly use a different library to parse RSS feed for improved > performance and compatibility > > On 6/28/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > > I am choosing to use NUTCH-444 for my RSS functionality. Doğacan commented > > on how to do this; he wrote: > > ...if you need the functionality of NUTCH-444, I would suggest > > trying a nightly version of Nutch. Becase NUTCH-444 by itself is not > > enough. You also need two patches from NUTCH-443 and probably > > NUTCH-504. > > > > I have a couple newbie questions about the mechanics of installing this. > > > > Prefatory comments: I have already installed another patch (for NUTCH-505) > > so I think I already have a nightly build (I'm guessing trunk==nightly?). > > These were the steps I did: > > $ svn co http://svn.apache.org/repos/asf/lucene/nutch/trunk nutch > > $ cd nutch > > $ wget > > https://issues.apache.org/jira/secure/attachment/12360411/NUTCH-505_draft_v2.patch > > $ patch -p0 < NUTCH-505_draft_v2.patch > > $ ant clean && ant > > > > --- > > > > Now I need NUTCH-443 NUTCH-504 NUTCH-444. Here's my guess: > > > > $ cd nutch > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12359953/NUTCH_443_reopened_v3.patch > > $ patch -p0 < NUTCH_443_reopened_v3.patch > > $ wget > > http://issues.apache.org/jira/secure/attachment/12350644/parse-map-core-draft-v1.patch > > $ patch -p0 < parse-map-core-draft-v1.patch > > $ wget > > http://issues.apache.org/jira/secure/attachment/12350634/parse-map-core-untested.patch > > $ patch -p0 < parse-map-core-untested.patch > > $ wget > > http://issues.apache.org/jira/secure/attachment/12357183/redirect_and_index.patch > > > > $ patch -p0 < redirect_and_index.patch > > > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12357300/redirect_and_index_v2.patch > > > > $ patch -p0 < redirect_and_index_v2.patch > > > > I'm really guessing on the above ... continuing: > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12360361/NUTCH-504_v2.patch > > > > $ patch -p0 < NUTCH-504_v2.patch > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12360348/parse_in_fetchers.patch > > > > $ patch -p0 < parse_in_fetchers.patch > > > > ... that felt like less of a guess, but now: > > > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch > > > > $ patch -p0 < NUTCH-444.patch > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2 > > > > $ tar xjvf parse-feed.tar.bz2 > > > > what do I do with this newly created parse-feed directory? > > > > so then I would do: > > > > $ ant clean && ant > > > > > > Wait a minute: do I have this whole thing wrong? Maybe Doğacan means that > > the nightly builds ALREADY contain NUTCH-443 and NUTCH-504 so that I would > > do this: > > > > > > $ wget > > http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/lastStableBuild/artifact/trunk/build/nutch-2007-06-27_06-52-44.tar.gz > > $ tar xvzf nutch-2007-06-27_06-52-44.tar.gz > > $ cd nutch-2007-06-27_06-52-44 > > > > then this business: > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12357192/NUTCH-444.patch > > > > > > $ patch -p0 < NUTCH-444.patch > > > > > > $ wget > > http://issues.apache.org/jira/secure/attachment/12350820/parse-feed.tar.bz2 > > > > > > $ tar xjvf parse-feed.tar.bz2 > > > > > > > > what do I do with this newly created parse-feed directory? > > > > > > > > so then I would do: > > > > > > > > $ ant clean && ant > > > > I guess this is why "release engineer" is a job in and of itself! > > Please advise. > > If you downloaded nightly build of 27th June, it contains feed plugin > already (the plugin is called "feed", not "parse-feed", parse-feed was > an older plugin and it is never committed. In my earlier comment, I > meant to write parse-rss but wrote parse-feed). So, you don't have to > apply any patches or anything. Just download a recent nightly build, > and you are good to go :). > > You can also checkout trunk from svn and it will work too. > > > > > --Kai Middleton > > > > ----- Original Message ---- > > From: Doğacan Güney <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Friday, June 22, 2007 1:39:12 AM > > Subject: Re: Possibly use a different library to parse RSS feed for > > improved performance and compatibility > > > > On 6/21/07, Kai_testing Middleton <[EMAIL PROTECTED]> wrote: > > > I am a new nutch user and the ability to crawl RSS feeds is critical to > > > my mission. Do I understand from this (lengthy) discussion that in order > > > to get the new RSS I need to either a) download one of the nightly builds > > > and run ant or b) download and apply a patch (NUTCH-444.patch, I gather). > > > > Nutch 0.9 can already parse RSS feeds (via parse-feed) plugin. > > However, if you need the functionality of NUTCH-444, I would suggest > > trying a nightly version of Nutch. Becase NUTCH-444 by itself is not > > enough. You also need two patches from NUTCH-443 and probably > > NUTCH-504. If you are worrying about stability, nightlies of nutch are > > generally pretty stable. > > > > -- > > Doğacan Güney > > > > > > > > > > x > > x > > x > > x > > x > > > > > > > > > > > > ____________________________________________________________________________________ > > Get the Yahoo! toolbar and be alerted to new email wherever you're surfing. > > http://new.toolbar.yahoo.com/toolbar/features/mail/index.php > > > -- > Doğacan Güney > > > > > > > > > ____________________________________________________________________________________ > Be a PS3 game guru. > Get your game face on with the latest PS3 news and previews at Yahoo! Games. > http://videogames.yahoo.com/platform?platform=120121 -- Doğacan Güney ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
