Re: Crawling (better: indexing) only certain URLS

Andrea Gazzarini Fri, 15 Apr 2016 12:24:06 -0700

Hi Furkan,
many thanks, I'm going to try and I'll let you know.

For the first question, I'm not sure about the overall size but we're
talking about 2milions (growing) pages; in general nothing that can be
easily handled with a  from-scratch and custom solution.


I was wondering if, from a functional perspective, Nutch is a good fit for
"automatizing" the periodic indexing (in Solr, this is my ultimate goal) of
that website. If that works the same mechanism will be used for other
websites as well.

Best,
Andrea
On 15 Apr 2016 18:16, "Furkan KAMACI" <furkankam...@gmail.com> wrote:

> Hi Andrea,
>
> Regex URL Filter works like that:
>
> This accepts anything else:
>
> *+.*
>
> Let's assume that you want to crawl Nutch's website. If you wished to limit
> the crawl to the nutch.apache.org domain, than definition should be that:
>
> * +^http://([a-z0-9]*\.)*nutch.apache.org/ <http://nutch.apache.org/>*
>
> So, if your more like this section has that pattern:
>
> *http://www.xyz.com/book/{book_id} <http://www.xyz.com/book/{book_id}>*
>
> Than your definition should be that:
>
> *+^http://www.xyz.com/book/([0-9]*\.)*
> <http://www.xyz.com/book/([0-9]*\.)*>*
>
> For your first question, you should tell us what is the approximate size of
> the data you will crawl, etc. and do you have any other needs?
>
> Kind Regards,
> Furkan KAMACI
>
>
> On Fri, Apr 15, 2016 at 4:17 PM, Andrea Gazzarini <gxs...@gmail.com>
> wrote:
>
> > Hi guys,
> > just playing as a Nutch newbie in a simple (at least I think) use case:
> >
> > I have a website (e.g. http://www.xyz.com) that allows searching for
> > books. Here, as any straight search website I have two kind of pages:
> >
> >  * a page that shows search results (depending on the user entered
> >    search terms)
> >  * a details page about a given book. Each details page is a permalink
> >    which follows a given naming convention (e.g.
> >    http://www.xyz.com/book/{book id})
> >
> > The details page has something like a "more like this" section that
> > contains permalinks to other (similar) books.
> > Now, my requirement is to index in Solr *all* details page of such
> website.
> >
> > If Nutch is a suitable tool for doing that (and this is actually the
> first
> > question), could you please give me some hint about how to configure it?
> >
> > Specifically, I tried put a seed file with just one entry
> >
> > http://www.xyx.com/book/1
> >
> > and then I configured my regex-urlfilter.txt
> >
> > +^http://www.xyx.com/book
> >
> > But it indexes only the /1 page. I imagined that the "more like this"
> > section of the /1 page would act as a set of outlinks for getting further
> > details pages (where in turns there are further MLT sections, and so on)
> >
> > Best,
> > Andrea
> >
> >
>

Re: Crawling (better: indexing) only certain URLS

Reply via email to