Re: Fetching a specific number of urls

Renato Marroquín Mogrovejo Sun, 12 May 2013 22:22:10 -0700

Hi Tejas,

So I started fresh. I deleted the webpage keyspace as I am using
Cassandra as a backend. But I did get the same output. I mean I get a
bunch of urls after I do a readdb -dump but not the ones I want. I get
only one fetched site, and many links parsed (to be parsed in the next
cycle?). Maybe it has to do something with the urls I am trying to
get?
I am trying to get this url and similar ones:


http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1

But I have noticed that the links pointing to the next ones are
something like this:

<a class="resultado_roda"
href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>

So I decided to try commenting this url rule:
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
#-.*(/[^/]+)/[^/]+\1/[^/]+\1/

But I got the same results. A single site fetched, some urls parsed
but not the ones I want using the regex-urlfilter.txt. Any Ideas?
Thanks a ton for your help Tejas!


Renato M.


2013/5/12 Tejas Patil <[email protected]>:
> Hi Renato,
>
> Thats weird. I ran a crawl over similar urls having a query in the end (
> http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
> My guess is that there is something wrong while parsing due to which
> outlinks are not getting into the crawldb.
>
> Start from fresh. Clear everything from previous attempts. (including the
> backend table named as the value of 'storage.schema.webpage').
> Run these :
> bin/nutch inject *<urldir>*
> bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
> bin/nutch fetch *<batchID>* -threads 2
> bin/nutch parse *<batchID> *
> bin/nutch updatedb
> bin/nutch readdb -dump <*output dir*>
>
> The readdb output will shown if the outlinks were extracted correctly.
>
> The commands for checking urlfilter rules accept one input url at a time
> from console (you need to type/paste the url and hit enter).
> It shows "+" if the url is accepted by the current rules. ("-" for
> rejection).
>
> Thanks,
> Tejas
>
> On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
> [email protected]> wrote:
>
>> And I did try the commands you told me but I am not sure how they
>> work. They do wait for an url to be input, but then it prints the url
>> with a '+' at the beginning, what does that mean?
>>
>> http://www.xyz.com/lanchon
>> +http://www.xyz.com/lanchon
>>
>> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
>> > Hi Tejas,
>> >
>> > Thanks for your help. I have tried the expression you suggested, and
>> > now my url-filter file is like this:
>> > +http://www.xyz.com/\?page=*
>> >
>> > # skip URLs containing certain characters as probable queries, etc.
>> > #-[?*!@=]
>> > +.
>> >
>> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
>> loops
>> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>> > +.
>> >
>> > # accept anything else
>> > +.
>> >
>> > So after this, I run a generate command -topN 5 -depth 5, and then a
>> > fetch all, but I keep on getting a single page fetched. What am I
>> > doing wrong? Thanks again for your help.
>> >
>> >
>> > Renato M.
>> >
>> > 2013/5/12 Tejas Patil <[email protected]>:
>> >> FYI: You can use anyone of these commands to run the regex-urlfilter
>> rules
>> >> against any given url:
>> >>
>> >> bin/nutch plugin urlfilter-regex
>> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >> OR
>> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>> >>
>> >> Both of them accept input url one at a time from stdin.
>> >> The later one has a param which can enable you to test a given url
>> against
>> >> several url filters at once. See its usage for more details.
>> >>
>> >>
>> >>
>> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]
>> >wrote:
>> >>
>> >>> If there is no restriction on the number at the end of the url, you
>> might
>> >>> just use this:
>> >>> (note that the rule must be above the one which filters urls with a "?"
>> >>> character)
>> >>>
>> >>> *+http://www.xyz.com/\?page=*
>> >>> *
>> >>> *
>> >>> *# skip URLs containing certain characters as probable queries, etc.*
>> >>> *-[?*!@=]*
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> Hi all,
>> >>>>
>> >>>> I have been trying to fetch a query similar to:
>> >>>>
>> >>>> http://www.xyz.com/?page=1
>> >>>>
>> >>>> But where the number can vary from 1 to 100. Inside the first page
>> >>>> there are links to the next ones. So I updated the
>> >>>> conf/regex-urlfilter file and added:
>> >>>>
>> >>>> ^[0-9]{1,45}$
>> >>>>
>> >>>> When I do this, the generate job fails saying that it is "Invalid
>> >>>> first character". I have tried generating with topN 5 and depth 5 and
>> >>>> trying to fetch more urls but that does not work.
>> >>>>
>> >>>> Could anyone advise me on how to accomplish this? I am running Nutch
>> 2.x.
>> >>>> Thanks in advance!
>> >>>>
>> >>>>
>> >>>> Renato M.
>> >>>>
>> >>>
>> >>>
>>

Re: Fetching a specific number of urls

Reply via email to