sorry,
my crawl-urlfilter.txt contains
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*com/

not
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*la/

2009/3/5 Yves Yu <[email protected]>

> yes.
> maybe I'm going to summarize my questions so that I can express myself more
> clearly.
>
> I set only www.aaa.com to urls.txt.
>
> The page www.aaa.com has a link www.bbb.com
> www.bbb.com has a link www.ccc.com
> www.ccc.com has a word: big
>
> if I search "big", can I get a result like www.ccc.com?
>
> -------------------------------------------------------------------------------------------
> my urls.txt is
> www.aaa.com
>
> -------------------------------------------------------------------------------------------
> my crawl-urlfilter.txt is
> # skip file:, ftp:, & mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
>
> -\.(js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
>
> # skip URLs containing certain characters as probable queries, etc.
> #-...@=]
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> -.*(/.+?)/.*?\1/.*?\1/
>
> # accept hosts in MY.DOMAIN.NAME
> +^http://([a-z0-9]*\.)*la/
>
> # skip everything else
> -.
>
> -------------------------------------------------------------------------------------------
> my nutch-site.xml contains:
> <property>
> <name>http.agent.url</name>
> <value>http://www.aaa.com/</value>
> <description>http://www.aaa.com/</description>
> </property>
>
> -------------------------------------------------------------------------------------------
> my nutch-default.xml contains following and I didn't change them in
> my nutch-site.xml
>
> <property>
>   <name>db.update.additions.allowed</name>
>   <value>true</value>
>   <description>If true, updatedb will add newly discovered URLs, if false
>   only already existing URLs in the CrawlDb will be updated and no new
>   URLs will be added.
>   </description>
> </property>
>
> <property>
>   <name>db.ignore.external.links</name>
>   <value>false</value>
>   <description>If true, outlinks leading from a page to external hosts
>   will be ignored. This is an effective way to limit the crawl to include
>   only initially injected hosts, without creating complex URLFilters.
>   </description>
> </property>
>
> -------------------------------------------------------------------------------------------
> with these configurations,
> I can crawl pages in www.aaa.com like www.aaa.com/xxxxx
> but I cannot crawl pages like www.bbb.com/xxxx or www.ccc.com/xxxx, is it
> right?
>
> if I want to crawl pages like www.bbb.com/xxx or www.ccc.com/xxx, what
> should I do?
>
> thank you for taking time to answer my question.
>
> Yves
>
> 2009/3/5 Alexander Aristov <[email protected]>
>
>> If you have taken nutch from trunk then these settings are already there.
>> You should just check them. If it is not then you can add them
>>
>> See also the nutch-default.xml file which contains all nutch settings. you
>> may copy necessary from this file and customize then in nutch-site.xml
>>
>> 2009/3/5 Yves Yu <[email protected]>
>>
>> > thanks, that would be very useful for me.
>> > just copy these two lines to nutch-site.xml file?
>> > if not, would you like to provide a template?
>> > - Show quoted text -
>> >
>> > 2009/3/5 Alexander Aristov <[email protected]>
>> >
>> > > in the nutch-site.xml file check settings
>> > >
>> > > db.update.additions.allowed
>> > > db.ignore.external.links
>> > >
>> > > Alexander
>> > >
>> > >
>> > > 2009/3/5 Yves Yu <[email protected]>
>> > >
>> > > > thank you
>> > > > my urls.txt is www.aaa.com
>> > > > must I add www.bbb.com and www.ccc.com here?
>> > > >
>> > > > my urlfilter is +^http://([a-z0-9]*\.)*com/
>> > > >
>> > > > by the way, how to check nutch settings if I allow adding outward
>> > links?
>> > > >
>> > > > 2009/3/5 Alexander Aristov <[email protected]>
>> > > > - Show quoted text -
>> > > >
>> > > > > I would suggest to check url filters. If you use the crawl command
>> > then
>> > > > it
>> > > > > is teh crawl url filter otherwise it is regex-urlfilter
>> > > > >
>> > > > >
>> > > > > And check nutch settings if you allow adding  outward links.
>> > > > >
>> > > > > 2009/3/5 Yves Yu <[email protected]>
>> > > > >
>> > > > > > yes, I'm using Luke now, and I see there is no www. bbb.com and
>> no
>> > > > > > www.ccc.com in crawling procedure. it only can crawling
>> > www.aaa.com,
>> > > > > > www.aaa.com\xxx\xxx, like these
>> > > > > > do you know what the problem is?
>> > > > > >
>> > > > > > 2009/3/4 Jasper Kamperman <[email protected]>
>> > > > > >
>> > > > > > > Oh and the documentation also specifies a depth parameter that
>> > says
>> > > > how
>> > > > > > far
>> > > > > > > afield the crawler may go. I think default is 10 but not sure.
>> > > > > > >
>> > > > > > > Sent from my iPhone
>> > > > > > >
>> > > > > > >
>> > > > > > > On Mar 3, 2009, at 12:53 PM, Yves Yu <[email protected]>
>> wrote:
>> > > > > > >
>> > > > > > >  you mean, we can do this without additional configuration?
>> how
>> > > about
>> > > > > 10
>> > > > > > >> depth like this? how can I set it?thanks.
>> > > > > > >>
>> > > > > > >> 2009/3/4 Jasper Kamperman <[email protected]
>> >
>> > > > > > >>
>> > > > > > >>  Could be a lot of reasons. I'd start by investigating the
>> index
>> > > > with
>> > > > > > Luke
>> > > > > > >>> to see if ccc made it into the index and if I can search out
>> > the
>> > > > page
>> > > > > > >>> with
>> > > > > > >>> the word "big". From what I find out with Luke I'd work my
>> way
>> > > back
>> > > > > to
>> > > > > > >>> the
>> > > > > > >>> root cause
>> > > > > > >>>
>> > > > > > >>> Sent from my iPhone
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> On Mar 3, 2009, at 7:40 AM, Yves Yu <[email protected]>
>> > wrote:
>> > > > > > >>>
>> > > > > > >>> Hi, all,
>> > > > > > >>>
>> > > > > > >>>> for example,
>> > > > > > >>>>
>> > > > > > >>>> The page www.aaa.com has a link www.bbb.com
>> > > > > > >>>> www.bbb.com has a link www.ccc.com
>> > > > > > >>>> www.ccc.com has a word: big
>> > > > > > >>>>
>> > > > > > >>>> It seems I cannot find "big" in www.ccc.com, is it
>> possible?
>> > > How
>> > > > > can
>> > > > > > I
>> > > > > > >>>> set
>> > > > > > >>>> the configurations?
>> > > > > > >>>>
>> > > > > > >>>> Thanks in advance!
>> > > > > > >>>>
>> > > > > > >>>> Yves
>> > > > > > >>>>
>> > > > > > >>>>
>> > > > > > >>>
>> > > > > >
>> > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > Best Regards
>> > > > > Alexander Aristov
>> > > > >
>> > > >
>> > >
>> > >
>> > >
>> > > --
>> > > Best Regards
>> > > Alexander Aristov
>> > >
>> >
>>
>>
>>
>> --
>> Best Regards
>> Alexander Aristov
>>
>
>

Reply via email to