Re: skipping invalid segments nutch 1.3

lewis john mcgibbney Thu, 21 Jul 2011 12:55:37 -0700

Hi Leo,

>From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL, however this may not be the case as I
have nothing to benchmark it on. Unfortuantely on the occasion the URL
http://wiki.apache.org actually redirects to
http://wiki.apache.org/general/so I'm going to post my log output from
last URL you specified in an attempt
to clear this one up. The following confirms that you are accurate with your
observations that not only does this produce invalid segments but also
nothing is fetched in the process.


Therefore the reason that we are getting the  - skipping invalid segment
message is that we are not actually fetching any content. My initial
thoughts were that your urlfilters were not set properly and I think that
this is part of the case.

Please follow the syntax very carefully and it will work perfectly for you
as follows

regex-urlfilter.txt
--------------------------

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# crawl URLs in the following domains.
+^http://([a-z0-9]*\.)*seek.com.au/

# accept anything else
#+.

seed file
----------------------
http://www.seek.com.au

It sounds really trivial but I think that the trailing '/' in in your seed
file may have been making all of the difference.

Please try, test with readdb and readseg and comment back.

Sorry for the delayed posts on this one I have not had much time to get to
it. Hope all goes to plan. Evidence can be seen below

lewis@lewis-01:~/ASF/branch-1.4/runtime/local$ bin/nutch readdb crawldb
-stats
CrawlDb statistics start: crawldb
Statistics for CrawlDb: crawldb
TOTAL urls:    48
retry 0:    48
min score:    0.017
avg score:    0.041125
max score:    1.175
status 1 (db_unfetched):    47
status 2 (db_fetched):    1
CrawlDb statistics: done





On Thu, Jul 21, 2011 at 3:30 AM, Leo Subscriptions <llsub...@zudiewiener.com
> wrote:

> Following are the suggested commands and the result as suggested
>  I left the redirect as 0 as 'crawl' works without any issues. The
> problem only occurs when running the individual commands.
>
> ------- nutch-site.xml -------------------------------
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>
> <!-- Put site-specific property overrides in this file. -->
>
> <configuration>
>
> <property>
>  <name>http.agent.name</name>
>  <value>listers spider</value>
> </property>
>
> <property>
>  <name>fetcher.verbose</name>
>  <value>true</value>
>  <description>If true, fetcher will log more verbosely.</description>
> </property>
>
> <property>
>  <name>http.verbose</name>
>  <value>true</value>
>  <description>If true, HTTP will log more verbosely.</description>
> </property>
>
> </configuration>
> ---------------------------------------------------------------
>
> ------ Individual commands and results-------------------------
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> inject /home/llist/nutchData/crawl/crawldb /home/llist/nutchData/seed/urls
> Injector: starting at 2011-07-21 12:24:52
> Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> Injector: urlDir: /home/llist/nutchData/seed/urls
> Injector: Converting injected urls to crawl db entries.
> Injector: Merging injected urls into crawl db.
> Injector: finished at 2011-07-21 12:24:55, elapsed: 00:00:02
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> generate /home/llist/nutchData/crawl/crawldb
> /home/llist/nutchData/crawl/segments -topN 100
> Generator: starting at 2011-07-21 12:25:16
> Generator: Selecting best-scoring urls due for fetch.
> Generator: filtering: true
> Generator: normalizing: true
> Generator: topN: 100
> Generator: jobtracker is 'local', generating exactly one partition.
> Generator: Partitioning selected urls for politeness.
> Generator: segment: /home/llist/nutchData/crawl/segments/20110721122519
> Generator: finished at 2011-07-21 12:25:20, elapsed: 00:00:03
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> fetch /home/llist/nutchData/crawl/segments/20110721122519
> Fetcher: Your 'http.agent.name' value should be listed first in
> 'http.robots.agents' property.
> Fetcher: starting at 2011-07-21 12:26:36
> Fetcher: segment: /home/llist/nutchData/crawl/segments/20110721122519
> Fetcher: threads: 10
> QueueFeeder finished: total 1 records + hit by time limit :0
> -finishing thread FetcherThread, activeThreads=1
> fetching http://wiki.apache.org/
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -finishing thread FetcherThread, activeThreads=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> -activeThreads=0
> Fetcher: finished at 2011-07-21 12:26:40, elapsed: 00:00:04
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> parse /home/llist/nutchData/crawl/segments/20110721122519
> ParseSegment: starting at 2011-07-21 12:27:22
> ParseSegment:
> segment: /home/llist/nutchData/crawl/segments/20110721122519
> ParseSegment: finished at 2011-07-21 12:27:24, elapsed: 00:00:01
>
>
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> updatedb /home/llist/nutchData/crawl/crawldb
> -dir /home/llist/nutchData/crawl/segments/20110721122519
> CrawlDb update: starting at 2011-07-21 12:28:03
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
> file:/home/llist/nutchData/crawl/segments/20110721122519/content,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: false
> CrawlDb update: URL filtering: false
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/content
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch
>  - skipping invalid segment
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate
> CrawlDb update: Merging segment data into db.
> CrawlDb update: finished at 2011-07-21 12:28:04, elapsed: 00:00:01
>
>
> ------------------------------------------------------------------------------------
>
>
>
> On Wed, 2011-07-20 at 21:58 +0100, lewis john mcgibbney wrote:
>
> > There is no documentation for individual commands used to run a Nutch 1.3
> > crawl so I'm not sure where there has been a mislead. In the instance
> that
> > this was required I would direct newer users to the legacy documentation
> for
> > the time being.
> >
> > My comment to Leo was to understand whether he managed to correct the
> > invalid segments problem.
> >
> > Leo, if this still persists may I ask you to try again, I will do the
> same
> > and will be happy to provide feedback
> >
> > May I suggest the following
> >
> >
> > use the following commands
> >
> > inject
> > generate
> > fetch
> > parse
> > updatedb
> >
> > At this stage we should be able to ascertain if something is correct and
> > hopefully debug. May I add the following... please make the following
> > additions to nutch-site.
> >
> > fetcher verbose - true
> > http verbose - true
> > check for redirects and set accordingly
> >
> >
> > On Wed, Jul 20, 2011 at 1:39 PM, Julien Nioche <
> > lists.digitalpeb...@gmail.com> wrote:
> >
> > > The wiki can be edited and you are welcome to suggest improvements if
> there
> > > is something missing
> > >
> > > On 20 July 2011 13:31, Cam Bazz <camb...@gmail.com> wrote:
> > >
> > > > Hello,
> > > >
> > > > I think there is a mislead in the documentation, it does not tell us
> > > > that we have to parse.
> > > >
> > > > On Wed, Jul 20, 2011 at 11:42 AM, Julien Nioche
> > > > <lists.digitalpeb...@gmail.com> wrote:
> > > > > Haven't you forgotten to call parse?
> > > > >
> > > > > On 19 July 2011 23:40, Leo Subscriptions <llsub...@zudiewiener.com
> >
> > > > wrote:
> > > > >
> > > > >> Hi Lewis,
> > > > >>
> > > > >> You are correct about the last post not showing any errors. I just
> > > > >> wanted to show that I don't get any errors if I use 'crawl' and to
> > > prove
> > > > >> that I do not have any faults in the conf files or the
> directories.
> > > > >>
> > > > >> I still get the errors if I use the individual commands inject,
> > > > >> generate, fetch....
> > > > >>
> > > > >> Cheers,
> > > > >>
> > > > >> Leo
> > > > >>
> > > > >>
> > > > >>
> > > > >>  On Tue, 2011-07-19 at 22:09 +0100, lewis john mcgibbney wrote:
> > > > >>
> > > > >> > Hi Leo
> > > > >> >
> > > > >> > Did you resolve?
> > > > >> >
> > > > >> > Your second log data doesn't appear to show any errors however
> the
> > > > >> > problem you specify if one I have witnessed myself while ago.
> Since
> > > > >> > you posted have you been able to replicate... or resolve?
> > > > >> >
> > > > >> >
> > > > >> > On Sun, Jul 17, 2011 at 1:03 AM, Leo Subscriptions
> > > > >> > <llsub...@zudiewiener.com> wrote:
> > > > >> >
> > > > >> >         I've used crawl to ensure config is correct and I don't
> get
> > > > >> >         any errors,
> > > > >> >         so I must be doing something wrong with the individual
> > > steps,
> > > > >> >         but can;t
> > > > >> >         see what.
> > > > >> >
> > > > >> >
> > > > >>
> > > >
> > >
> --------------------------------------------------------------------------------------------------------------------
> > > > >> >
> > > > >> >         llist@LeosLinux:~/nutchData
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >
> > > > >> >
> > > > >> >         crawl /home/llist/nutchData/seed/urls
> > > > >> >         -dir /home/llist/nutchData/crawl
> > > > >> >         -depth 3 -topN 5
> > > > >> >         solrUrl is not set, indexing will be skipped...
> > > > >> >         crawl started in: /home/llist/nutchData/crawl
> > > > >> >         rootUrlDir = /home/llist/nutchData/seed/urls
> > > > >> >         threads = 10
> > > > >> >         depth = 3
> > > > >> >         solrUrl=null
> > > > >> >         topN = 5
> > > > >> >         Injector: starting at 2011-07-17 09:31:19
> > > > >> >
> > > > >> >         Injector: crawlDb: /home/llist/nutchData/crawl/crawldb
> > > > >> >
> > > > >> >
> > > > >> >         Injector: urlDir: /home/llist/nutchData/seed/urls
> > > > >> >
> > > > >> >         Injector: Converting injected urls to crawl db entries.
> > > > >> >         Injector: Merging injected urls into crawl db.
> > > > >> >
> > > > >> >
> > > > >> >         Injector: finished at 2011-07-17 09:31:22, elapsed:
> 00:00:02
> > > > >> >         Generator: starting at 2011-07-17 09:31:22
> > > > >> >
> > > > >> >         Generator: Selecting best-scoring urls due for fetch.
> > > > >> >         Generator: filtering: true
> > > > >> >         Generator: normalizing: true
> > > > >> >
> > > > >> >
> > > > >> >         Generator: topN: 5
> > > > >> >
> > > > >> >         Generator: jobtracker is 'local', generating exactly one
> > > > >> >         partition.
> > > > >> >         Generator: Partitioning selected urls for politeness.
> > > > >> >
> > > > >> >
> > > > >> >         Generator:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110717093124
> > > > >> >         Generator: finished at 2011-07-17 09:31:26, elapsed:
> > > 00:00:04
> > > > >> >
> > > > >> >         Fetcher: Your 'http.agent.name' value should be listed
> > > first
> > > > >> >         in
> > > > >> >         'http.robots.agents' property.
> > > > >> >
> > > > >> >
> > > > >> >         Fetcher: starting at 2011-07-17 09:31:26
> > > > >> >         Fetcher:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110717093124
> > > > >> >
> > > > >> >         Fetcher: threads: 10
> > > > >> >         QueueFeeder finished: total 1 records + hit by time
> limit :0
> > > > >> >         fetching http://www.seek.com.au/
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >
> > > > >> >         -finishing thread FetcherThread, activeThreads=1
> > > > >> >         -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0
> > > > >> >         -finishing thread FetcherThread, activeThreads=0
> > > > >> >         -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0
> > > > >> >         -activeThreads=0
> > > > >> >
> > > > >> >
> > > > >> >         Fetcher: finished at 2011-07-17 09:31:29, elapsed:
> 00:00:03
> > > > >> >         ParseSegment: starting at 2011-07-17 09:31:29
> > > > >> >         ParseSegment:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110717093124
> > > > >> >         ParseSegment: finished at 2011-07-17 09:31:32, elapsed:
> > > > >> >         00:00:02
> > > > >> >         CrawlDb update: starting at 2011-07-17 09:31:32
> > > > >> >
> > > > >> >         CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> > > > >> >         CrawlDb update: segments:
> > > > >> >
> > > > >> >
> > > > >> >         [/home/llist/nutchData/crawl/segments/20110717093124]
> > > > >> >
> > > > >> >         CrawlDb update: additions allowed: true
> > > > >> >
> > > > >> >
> > > > >> >         CrawlDb update: URL normalizing: true
> > > > >> >         CrawlDb update: URL filtering: true
> > > > >> >
> > > > >> >         CrawlDb update: Merging segment data into db.
> > > > >> >
> > > > >> >
> > > > >> >         CrawlDb update: finished at 2011-07-17 09:31:34,
> elapsed:
> > > > >> >         00:00:02
> > > > >> >         :
> > > > >> >         :
> > > > >> >         :
> > > > >> >         :
> > > > >> >
> > > > >>
> > > >
> > >
> -----------------------------------------------------------------------------------------------
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >         On Sat, 2011-07-16 at 12:14 +1000, Leo Subscriptions
> wrote:
> > > > >> >
> > > > >> >         > Done, but now get additional errors:
> > > > >> >         >
> > > > >> >         > -------------------
> > > > >> >         > llist@LeosLinux:~/nutchData
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > updatedb /home/llist/nutchData/crawl/crawldb
> > > > >> >         > -dir
> /home/llist/nutchData/crawl/segments/20110716105826
> > > > >> >         > CrawlDb update: starting at 2011-07-16 11:03:56
> > > > >> >         > CrawlDb update: db:
> /home/llist/nutchData/crawl/crawldb
> > > > >> >         > CrawlDb update: segments:
> > > > >> >         >
> > > > >> >
> > > > >>
> [file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch,
> > > > >> >         >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content,
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse,
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data,
> > > > >> >         >
> > > > >> >
> > > > >>
> > >
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate,
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text]
> > > > >> >         > CrawlDb update: additions allowed: true
> > > > >> >         > CrawlDb update: URL normalizing: false
> > > > >> >         > CrawlDb update: URL filtering: false
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_fetch
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110716105826/content
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_parse
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_data
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> > > file:/home/llist/nutchData/crawl/segments/20110716105826/crawl_generate
> > > > >> >         >  - skipping invalid segment
> > > > >> >         >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110716105826/parse_text
> > > > >> >         > CrawlDb update: Merging segment data into db.
> > > > >> >         > CrawlDb update: finished at 2011-07-16 11:03:57,
> elapsed:
> > > > >> >         00:00:01
> > > > >> >         > -------------------------------------------
> > > > >> >         >
> > > > >> >         > On Sat, 2011-07-16 at 02:36 +0200, Markus Jelsma
> wrote:
> > > > >> >         >
> > > > >> >         > > fetch, then parse.
> > > > >> >         > >
> > > > >> >         > > > I'm running nutch 1.3 on 64 bit Ubuntu, following
> are
> > > > >> >         the commands and
> > > > >> >         > > > relevant output.
> > > > >> >         > > >
> > > > >> >         > > > ----------------------------------
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > >
> > > > >> >         inject /home/llist/nutchData/crawl/crawldb
> > > > >> /home/llist/nutchData/seed
> > > > >> >         > > > Injector: starting at 2011-07-15 18:32:10
> > > > >> >         > > > Injector: crawlDb:
> /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > Injector: urlDir: /home/llist/nutchData/seed
> > > > >> >         > > > Injector: Converting injected urls to crawl db
> > > entries.
> > > > >> >         > > > Injector: Merging injected urls into crawl db.
> > > > >> >         > > > Injector: finished at 2011-07-15 18:32:13,
> elapsed:
> > > > >> >         00:00:02
> > > > >> >         > > > =================
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > > generate /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > /home/llist/nutchData/crawl/segments Generator:
> > > starting
> > > > >> >         at 2011-07-15
> > > > >> >         > > > 18:32:41
> > > > >> >         > > > Generator: Selecting best-scoring urls due for
> fetch.
> > > > >> >         > > > Generator: filtering: true
> > > > >> >         > > > Generator: normalizing: true
> > > > >> >         > > > Generator: jobtracker is 'local', generating
> exactly
> > > one
> > > > >> >         partition.
> > > > >> >         > > > Generator: Partitioning selected urls for
> politeness.
> > > > >> >         > > > Generator:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > Generator: finished at 2011-07-15 18:32:45,
> elapsed:
> > > > >> >         00:00:03
> > > > >> >         > > > ==================
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > >
> > > > >> >         fetch
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > Fetcher: Your 'http.agent.name' value should be
> > > listed
> > > > >> >         first in
> > > > >> >         > > > 'http.robots.agents' property.
> > > > >> >         > > > Fetcher: starting at 2011-07-15 18:34:55
> > > > >> >         > > > Fetcher:
> > > > >> >         segment:
> /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > Fetcher: threads: 10
> > > > >> >         > > > QueueFeeder finished: total 1 records + hit by
> time
> > > > >> >         limit :0
> > > > >> >         > > > fetching http://www.seek.com.au/
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=2
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=1
> > > > >> >         > > > -activeThreads=1, spinWaiting=0,
> > > fetchQueues.totalSize=0
> > > > >> >         > > > -finishing thread FetcherThread, activeThreads=0
> > > > >> >         > > > -activeThreads=0, spinWaiting=0,
> > > fetchQueues.totalSize=0
> > > > >> >         > > > -activeThreads=0
> > > > >> >         > > > Fetcher: finished at 2011-07-15 18:34:59, elapsed:
> > > > >> >         00:00:03
> > > > >> >         > > > =================
> > > > >> >         > > > llist@LeosLinux:~
> > > > >> >         $ /usr/share/nutch/runtime/local/bin/nutch
> > > > >> >         > > > updatedb /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > -dir
> > > /home/llist/nutchData/crawl/segments/20110715183244
> > > > >> >         > > > CrawlDb update: starting at 2011-07-15 18:36:00
> > > > >> >         > > > CrawlDb update: db:
> > > /home/llist/nutchData/crawl/crawldb
> > > > >> >         > > > CrawlDb update: segments:
> > > > >> >         > > >
> > > > >> >
> > > > >>
> [file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch,
> > > > >> >         > > >
> > > > >> >
> > > > >>
> > >
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate,
> > > > >> >         > > >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content]
> > > > >> >         > > > CrawlDb update: additions allowed: true
> > > > >> >         > > > CrawlDb update: URL normalizing: false
> > > > >> >         > > > CrawlDb update: URL filtering: false
> > > > >> >         > > > - skipping invalid segment
> > > > >> >         > > >
> > > > >> >
> > > > >>
> file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_fetch
> > > > >> >         > > > - skipping invalid segment
> > > > >> >         > > >
> > > > >> >
> > > > >>
> > > file:/home/llist/nutchData/crawl/segments/20110715183244/crawl_generate
> > > > >> >         > > > - skipping invalid segment
> > > > >> >         > > >
> > > > >> >
> > > > file:/home/llist/nutchData/crawl/segments/20110715183244/content
> > > > >> >         > > > CrawlDb update: Merging segment data into db.
> > > > >> >         > > > CrawlDb update: finished at 2011-07-15 18:36:01,
> > > > >> >         elapsed: 00:00:01
> > > > >> >         > > > -----------------------------------
> > > > >> >         > > >
> > > > >> >         > > > Appreciate any hints on what I'm missing.
> > > > >> >         >
> > > > >> >         >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > --
> > > > >> > Lewis
> > > > >> >
> > > > >>
> > > > >>
> > > > >>
> > > > >
> > > > >
> > > > > --
> > > > > *
> > > > > *Open Source Solutions for Text Engineering
> > > > >
> > > > > http://digitalpebble.blogspot.com/
> > > > > http://www.digitalpebble.com
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > *
> > > *Open Source Solutions for Text Engineering
> > >
> > > http://digitalpebble.blogspot.com/
> > > http://www.digitalpebble.com
> > >
> >
> >
> >
>
>
>


-- 
*Lewis*

Re: skipping invalid segments nutch 1.3

Reply via email to