Re: Crawling (better: indexing) only certain URLS

Andrea Gazzarini Fri, 15 Apr 2016 22:59:06 -0700

Hi Furkan,
I'm not able to have it working. Maybe I misunderstood your email.


Simplifying, let's assume my website has the following structure

http://www.xyz.com/book/1

that contains a link towards

http://www.xyz.com/book/2
http://www.xyz.com/book/3

The /2 and /3 also contain some outlink, so the site map is the following


   - http://www.xyz.com/book/1
   - http://www.xyz.com/book/2
      - http://www.xyz.com/book/5
         - http://www.xyz.com/book/6
         - http://www.xyz.com/book/3
      - http://www.xyz.com/book/7
         - http://www.xyz.com/book/8

I put

http://www.xyz.com/book/1

in the seed file and the following line in the regex-urlfilter.txt (the
only uncommented line)

+^http://www.xyz.com/book/([0-9]*\.)

Running

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/woozlee
urls/few/captain-gazza.txt TestCrawl x


Injector: Total number of urls rejected by filters: 0

Injector: Total number of urls after normalization: 1

...

Indexing 1 documents

Indexer: number of documents indexed, deleted, or skipped:

Indexer:      1  indexed (add/update)

Indexer: finished at 2016-04-16 07:54:35, elapsed: 00:00:04

Cleaning up index if possible

/home/solr/apache-nutch-1.11/bin/nutch clean -Dsolr.server.url=
http://localhost:8983/solr/woozlee TestCrawl/crawldb

Sat Apr 16 07:54:39 CEST 2016 : Iteration 2 of 5

Generating a new segment

/home/solr/apache-nutch-1.11/bin/nutch generate -D mapreduce.job.reduces=2
-D mapred.child.java.opts=-Xmx1000m -D mapreduce.reduce.speculative=false
-D mapreduce.map.speculative=false -D mapreduce.map.output.compress=true
TestCrawl/crawldb TestCrawl/segments -topN 50000 -numFetchers 1 -noFilter

Generator: starting at 2016-04-16 07:54:40

Generator: Selecting best-scoring urls due for fetch.

Generator: filtering: false

Generator: normalizing: true

Generator: topN: 50000

Generator: 0 records selected for fetching, exiting ...

Generate returned 1 (no new segments created)

Escaping loop: no more URLs to fetch now


Whatever is *x*, the cycle completes shortly and indexes only the URL in
the seed list (i.e. I have one record indexed in Solr).


Again, many thanks for your help


Best,

Andrea

On Fri, Apr 15, 2016 at 9:23 PM, Andrea Gazzarini <gxs...@gmail.com> wrote:

> Hi Furkan,
> many thanks, I'm going to try and I'll let you know.
>
> For the first question, I'm not sure about the overall size but we're
> talking about 2milions (growing) pages; in general nothing that can be
> easily handled with a  from-scratch and custom solution.
>
> I was wondering if, from a functional perspective, Nutch is a good fit for
> "automatizing" the periodic indexing (in Solr, this is my ultimate goal) of
> that website. If that works the same mechanism will be used for other
> websites as well.
>
> Best,
> Andrea
> On 15 Apr 2016 18:16, "Furkan KAMACI" <furkankam...@gmail.com> wrote:
>
>> Hi Andrea,
>>
>> Regex URL Filter works like that:
>>
>> This accepts anything else:
>>
>> *+.*
>>
>> Let's assume that you want to crawl Nutch's website. If you wished to
>> limit
>> the crawl to the nutch.apache.org domain, than definition should be that:
>>
>> * +^http://([a-z0-9]*\.)*nutch.apache.org/ <http://nutch.apache.org/>*
>>
>> So, if your more like this section has that pattern:
>>
>> *http://www.xyz.com/book/{book_id} <http://www.xyz.com/book/{book_id}>*
>>
>> Than your definition should be that:
>>
>> *+^http://www.xyz.com/book/([0-9]*\.)*
>> <http://www.xyz.com/book/([0-9]*\.)*>*
>>
>> For your first question, you should tell us what is the approximate size
>> of
>> the data you will crawl, etc. and do you have any other needs?
>>
>> Kind Regards,
>> Furkan KAMACI
>>
>>
>> On Fri, Apr 15, 2016 at 4:17 PM, Andrea Gazzarini <gxs...@gmail.com>
>> wrote:
>>
>> > Hi guys,
>> > just playing as a Nutch newbie in a simple (at least I think) use case:
>> >
>> > I have a website (e.g. http://www.xyz.com) that allows searching for
>> > books. Here, as any straight search website I have two kind of pages:
>> >
>> >  * a page that shows search results (depending on the user entered
>> >    search terms)
>> >  * a details page about a given book. Each details page is a permalink
>> >    which follows a given naming convention (e.g.
>> >    http://www.xyz.com/book/{book id})
>> >
>> > The details page has something like a "more like this" section that
>> > contains permalinks to other (similar) books.
>> > Now, my requirement is to index in Solr *all* details page of such
>> website.
>> >
>> > If Nutch is a suitable tool for doing that (and this is actually the
>> first
>> > question), could you please give me some hint about how to configure it?
>> >
>> > Specifically, I tried put a seed file with just one entry
>> >
>> > http://www.xyx.com/book/1
>> >
>> > and then I configured my regex-urlfilter.txt
>> >
>> > +^http://www.xyx.com/book
>> >
>> > But it indexes only the /1 page. I imagined that the "more like this"
>> > section of the /1 page would act as a set of outlinks for getting
>> further
>> > details pages (where in turns there are further MLT sections, and so on)
>> >
>> > Best,
>> > Andrea
>> >
>> >
>>
>

Re: Crawling (better: indexing) only certain URLS

Reply via email to