RE: Does anybody know how to let nutch crawl this kind of website?

Windflying Tue, 11 Nov 2008 17:26:16 -0800

Hi Alex, 

Good day. Sorry to interrupt you again.


I fine two website,
http://svn.macosforge.org/repository/macports/ 
http://svn.collab.net/repos/svn/

When I use my nutch to crawl them, I got:
Generator: 0 records selected for fetching, exiting ...
Stopping at depth=0 - no more URLs to fetch.

I have configured the nutch-site.xml and crawl-urlfilter.txt.
As I can crawl http://svn.apache.org/repos/asf/lucene/nutch/ , so I assume
my configuration is ok. Do u think so?
Just make sure no more work with my nutch configuration.

Thanks.

-----Original Message-----
From: Alexander Aristov [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 11 November 2008 11:07 PM
To: nutch-user@lucene.apache.org
Subject: Re: Does anybody know how to let nutch crawl this kind of website?

No, you do not. Forget about it then, Nutch should crawl such sites without
any problems. So you have problem with something else.

Alexander

2008/11/11 Windflying <[EMAIL PROTECTED]>

> No, it is "404 Not Found" for http://svn.smartlabs.com/robots.txt.
> Do I need to add one? Sorry for my silly questions.
>
> Thanks.
>
> -----Original Message-----
> From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, 11 November 2008 10:41 PM
> To: nutch-user@lucene.apache.org
> Subject: Re: Does anybody know how to let nutch crawl this kind of
website?
>
> The robots.txt file is available by this address
>
> http://your_host/robots.txt
>
> for example : http://svn.apache.org/robots.txt
>
> Check it and if the file is like you wrote then it's not surprisingly that
> Nutch doesn't crawl your svn.
>
> Alexander
>
>
> 2008/11/11 Windflying <[EMAIL PROTECTED]>
>
> > I guess we don't have robots.txt in svn. Only found this file in
> > folder/usr/share/Nagios/ as following:
> >   "User-agent: *
> >    Disallow: /"
> >
> > What's this file for?
> >
> > -----Original Message-----
> > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > Sent: Tuesday, 11 November 2008 4:50 PM
> > To: nutch-user@lucene.apache.org
> > Subject: Re: Does anybody know how to let nutch crawl this kind of
> website?
> >
> >  I don't know how to configure your svn and add XSLT. But if your svn
can
> > be
> > viewed from a brawser then it should always be crawled by Nutch. One
> note,
> > does your svn has the robots.txt file? Nutch is polite to public
> resources
> > and respects their rules. Check the file if it exists and allows robots.
> >
> > Are you using inranet crawling or internet? There are differences in
> > configuration.
> >
> > Alexander
> >
> > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> >
> > > Hi Alex,
> > > Thanks for your reply. :)
> > >
> > > Yes, you are right. I just tried to search
> > > http://svn.apache.org/repos/asf/lucene/nutch/, and it did work.
> > >
> > > But I still can not search my own svn repository site.
> > > Generator: 0 records selected for fetching, exiting...
> > > Stopping at depth=0 - no more URLs to fetch.
> > > Authentication is not a problem. I already used the https-client
> plugin.
> > > Some resources stored in this svn repository are also referenced by
> > another
> > > intranet website, and they all can be searched and indexed from that
> > > website.
> > >
> > > I am new here. What I was told is that in teh case of my company svn
> the
> > > xml
> > > files are just file/folder names, most of the useful stuff in the svn
> is
> > > just referenced by the xml. What the XML Stylesheet does is turn the
> XML
> > > into HTML so the broswers can follow the links.
> > >
> > > I guess there must be something difference inbetween NutchSVN and my
> > > company
> > > SVN, which I do not know yet.
> > >
> > > Thanks & best regards,.
> > >
> > > -----Original Message-----
> > > From: Alexander Aristov [mailto:[EMAIL PROTECTED]
> > > Sent: Tuesday, 11 November 2008 3:33 PM
> > > To: nutch-user@lucene.apache.org
> > > Subject: Re: Does anybody know how to let nutch crawl this kind of
> > website?
> > >
> > > this should work in the same way as for other sites. Folders are
> regular
> > > links. If you are talking about parsing content (files in the
> repository)
> > > then you should have necessary parsers, for example the text parser,
> xml
> > > parser ...
> > >
> > > And you should give anonymouse access to svn or configure nutch to
sign
> > in.
> > >
> > > Alexander
> > >
> > > 2008/11/11 Windflying <[EMAIL PROTECTED]>
> > >
> > > > Hi all,
> > > >
> > > > My company intranet website is a svn repository, similar to :
> > > > http://svn.apache.org/repos/asf/lucene/nutch/ .
> > > >
> > > > Does anybody have an idea on how to let nutch do search on it?
> > > >
> > > >
> > > >
> > > > Thanks.
> > > >
> > > >
> > > >
> > > > Bryan
> > > >
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Best Regards
> > > Alexander Aristov
> > >
> > >
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
>
> --
> Best Regards
> Alexander Aristov
>
>


-- 
Best Regards
Alexander Aristov

RE: Does anybody know how to let nutch crawl this kind of website?

Reply via email to