Re: skipping invalid segments nutch 1.3

lewis john mcgibbney Wed, 20 Jul 2011 13:59:25 -0700

There is no documentation for individual commands used to run a Nutch 1.3
crawl so I'm not sure where there has been a mislead. In the instance that
this was required I would direct newer users to the legacy documentation for
the time being.


My comment to Leo was to understand whether he managed to correct the
invalid segments problem.

Leo, if this still persists may I ask you to try again, I will do the same
and will be happy to provide feedback

May I suggest the following


use the following commands

inject
generate
fetch
parse
updatedb

At this stage we should be able to ascertain if something is correct and
hopefully debug. May I add the following... please make the following
additions to nutch-site.

fetcher verbose - true
http verbose - true
check for redirects and set accordingly


On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
[email protected]> wrote:

> The wiki can be edited and you are welcome to suggest improvements if there
> is something missing
>
> On 20 July 2011 13:31, Cam Bazz <[email protected]> wrote:
>
> > Hello,
> >
> > I think there is a mislead in the documentation, it does not tell us
> > that we have to parse.
> >
> > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> > <[email protected]> wrote:
> > > Haven't you forgotten to call parse?
> > >
> > > On 19 July 2011 23:40, Leo Subscriptions <[email protected]>
> > wrote:
> > >
> > >> Hi Lewis,
> > >>
> > >> You are correct about the last post not showing any errors. I just
> > >> wanted to show that I don't get any errors if I use 'crawl' and to
> prove
> > >> that I do not have any faults in the conf files or the directories.
> > >>
> > >> I still get the errors if I use the individual commands inject,
> > >> generate, fetch....
> > >>
> > >> Cheers,
> > >>
> > >> Leo
> > >>
> > >>
> > >>
> > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
> > >>
> > >> > Hi Leo
> > >> >
> > >> > Did you resolve?
> > >> >
> > >> > Your second log data doesn't appear to show any errors however the
> > >> > problem you specify if one I have witnessed myself while ago. Since
> > >> > you posted have you been able to replicate... or resolve?
> > >> >
> > >> >
> > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> > >> > <[email protected]> wrote:
> > >> >
> > >> >         I've used crawl to ensure config is correct and I don't get
> > >> >         any errors,
> > >> >         so I must be doing something wrong with the individual
> steps,
> > >> >         but can;t
> > >> >         see what.
> > >> >
> > >> >
> > >>
> >
> --------------------------------------------------------------------------------------------------------------------
> > >> >
> > >> >         llist@LeosLinux:~/nutchData
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >
> > >> >
> > >> >         crawl /home/llist/nutchData/seed/urls
> > >> >         -dir /home/llist/nutchData/crawl
> > >> >         -depth 3 -topN 5
> > >> >         solrUrl is not set, indexing will be skipped...
> > >> >         crawl started in: /home/llist/nutchData/crawl
> > >> >         rootUrlDir = /home/llist/nutchData/seed/urls
> > >> >         threads = 10
> > >> >         depth = 3
> > >> >         solrUrl=null
> > >> >         topN = 5
> > >> >         Injector: starting at 2011-07-17 09:31:19
> > >> >
> > >> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > >> >
> > >> >
> > >> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> > >> >
> > >> >         Injector: Converting injected urls to crawl db entries.
> > >> >         Injector: Merging injected urls into crawl db.
> > >> >
> > >> >
> > >> >         Injector: finished at 2011-07-17 09:31:22, elapsed: 00:00:02
> > >> >         Generator: starting at 2011-07-17 09:31:22
> > >> >
> > >> >         Generator: Selecting best-scoring urls due for fetch.
> > >> >         Generator: filtering: true
> > >> >         Generator: normalizing: true
> > >> >
> > >> >
> > >> >         Generator: topN: 5
> > >> >
> > >> >         Generator: jobtracker is 'local', generating exactly one
> > >> >         partition.
> > >> >         Generator: Partitioning selected urls for politeness.
> > >> >
> > >> >
> > >> >         Generator:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > >> >         Generator: finished at 2011-07-17 09:31:26, elapsed:
> 00:00:04
> > >> >
> > >> >         Fetcher: Your 'http.agent.name' value should be listed
> first
> > >> >         in
> > >> >         'http.robots.agents' property.
> > >> >
> > >> >
> > >> >         Fetcher: starting at 2011-07-17 09:31:26
> > >> >         Fetcher:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > >> >
> > >> >         Fetcher: threads: 10
> > >> >         QueueFeeder finished: total 1 records + hit by time limit :0
> > >> >         fetching http://www.seek.com.au/
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >
> > >> >         -finishing thread FetcherThread, activeThreads=1
> > >> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > >> >         -finishing thread FetcherThread, activeThreads=0
> > >> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > >> >         -activeThreads=0
> > >> >
> > >> >
> > >> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed: 00:00:03
> > >> >         ParseSegment: starting at 2011-07-17 09:31:29
> > >> >         ParseSegment:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110717093124
> > >> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> > >> >         00:00:02
> > >> >         CrawlDb update: starting at 2011-07-17 09:31:32
> > >> >
> > >> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > >> >         CrawlDb update: segments:
> > >> >
> > >> >
> > >> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> > >> >
> > >> >         CrawlDb update: additions allowed: true
> > >> >
> > >> >
> > >> >         CrawlDb update: URL normalizing: true
> > >> >         CrawlDb update: URL filtering: true
> > >> >
> > >> >         CrawlDb update: Merging segment data into db.
> > >> >
> > >> >
> > >> >         CrawlDb update: finished at 2011-07-17 09:31:34, elapsed:
> > >> >         00:00:02
> > >> >         :
> > >> >         :
> > >> >         :
> > >> >         :
> > >> >
> > >>
> >
> -----------------------------------------------------------------------------------------------
> > >> >
> > >> >
> > >> >
> > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions wrote:
> > >> >
> > >> >         > Done, but now get additional errors:
> > >> >         >
> > >> >         > -------------------
> > >> >         > llist@LeosLinux:~/nutchData
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > updatedb /home/llist/nutchData/crawl/crawldb
> > >> >         > -dir /home/llist/nutchData/crawl/segments/20110716105826
> > >> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> > >> >         > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > >> >         > CrawlDb update: segments:
> > >> >         >
> > >> >
> > >> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > >> >         >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > >> >         >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > >> >         > CrawlDb update: additions allowed: true
> > >> >         > CrawlDb update: URL normalizing: false
> > >> >         > CrawlDb update: URL filtering: false
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> > >> >         >  - skipping invalid segment
> > >> >         >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > >> >         > CrawlDb update: Merging segment data into db.
> > >> >         > CrawlDb update: finished at 2011-07-16 11:03:57, elapsed:
> > >> >         00:00:01
> > >> >         > -------------------------------------------
> > >> >         >
> > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma wrote:
> > >> >         >
> > >> >         > > fetch, then parse.
> > >> >         > >
> > >> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following are
> > >> >         the commands and
> > >> >         > > > relevant output.
> > >> >         > > >
> > >> >         > > > ----------------------------------
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > >
> > >> >         inject /home/llist/nutchData/crawl/crawldb
> > >> /home/llist/nutchData/seed
> > >> >         > > > Injector: starting at 2011-07-15 18:32:10
> > >> >         > > > Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > >> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> > >> >         > > > Injector: Converting injected urls to crawl db
> entries.
> > >> >         > > > Injector: Merging injected urls into crawl db.
> > >> >         > > > Injector: finished at 2011-07-15 18:32:13, elapsed:
> > >> >         00:00:02
> > >> >         > > > =================
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > > generate /home/llist/nutchData/crawl/crawldb
> > >> >         > > > /home/llist/nutchData/crawl/segments Generator:
> starting
> > >> >         at 2011-07-15
> > >> >         > > > 18:32:41
> > >> >         > > > Generator: Selecting best-scoring urls due for fetch.
> > >> >         > > > Generator: filtering: true
> > >> >         > > > Generator: normalizing: true
> > >> >         > > > Generator: jobtracker is 'local', generating exactly
> one
> > >> >         partition.
> > >> >         > > > Generator: Partitioning selected urls for politeness.
> > >> >         > > > Generator:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > Generator: finished at 2011-07-15 18:32:45, elapsed:
> > >> >         00:00:03
> > >> >         > > > ==================
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > >
> > >> >         fetch /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > Fetcher: Your 'http.agent.name' value should be
> listed
> > >> >         first in
> > >> >         > > > 'http.robots.agents' property.
> > >> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> > >> >         > > > Fetcher:
> > >> >         segment: /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > Fetcher: threads: 10
> > >> >         > > > QueueFeeder finished: total 1 records + hit by time
> > >> >         limit :0
> > >> >         > > > fetching http://www.seek.com.au/
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=2
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > >> >         > > > -activeThreads=1, spinWaiting=0,
> fetchQueues.totalSize=0
> > >> >         > > > -finishing thread FetcherThread, activeThreads=0
> > >> >         > > > -activeThreads=0, spinWaiting=0,
> fetchQueues.totalSize=0
> > >> >         > > > -activeThreads=0
> > >> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> > >> >         00:00:03
> > >> >         > > > =================
> > >> >         > > > llist@LeosLinux:~
> > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > >> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> > >> >         > > > -dir
> /home/llist/nutchData/crawl/segments/20110715183244
> > >> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > >> >         > > > CrawlDb update: db:
> /home/llist/nutchData/crawl/crawldb
> > >> >         > > > CrawlDb update: segments:
> > >> >         > > >
> > >> >
> > >> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > >> >         > > >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > >> >         > > >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > >> >         > > > CrawlDb update: additions allowed: true
> > >> >         > > > CrawlDb update: URL normalizing: false
> > >> >         > > > CrawlDb update: URL filtering: false
> > >> >         > > > - skipping invalid segment
> > >> >         > > >
> > >> >
> > >> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > >> >         > > > - skipping invalid segment
> > >> >         > > >
> > >> >
> > >>
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > >> >         > > > - skipping invalid segment
> > >> >         > > >
> > >> >
> > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > >> >         > > > CrawlDb update: Merging segment data into db.
> > >> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> > >> >         elapsed: 00:00:01
> > >> >         > > > -----------------------------------
> > >> >         > > >
> > >> >         > > > Appreciate any hints on what I'm missing.
> > >> >         >
> > >> >         >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > --
> > >> > Lewis
> > >> >
> > >>
> > >>
> > >>
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
>
>
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>



-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

Reply via email to