Re: Fetching a specific number of urls

Renato Marroquín Mogrovejo Thu, 16 May 2013 11:53:56 -0700

Well I have managed to get the same results as you have (I think). Now
on my crawldb there are the links with the following structure:


+http://www.xyz.com/\?page=*

But there are also many other links, how would I do to only get the
links in the above format? I mean ignoring all the others and only
getting the ones with the same structure.
I have also noticed something interesting, that if I use:

./bin/nutch generate -topN 10 -numFetchers 1 -depth 10  -noFilter -adddays 0

I only get the same seed url but no others, is this caused by the
depth parameter?
Thanks again!


Renato M.


2013/5/16 Renato Marroquín Mogrovejo <[email protected]>:
> Hi Tejas,
>
> Thank you very much for your help again.
> But I'm sorry to inform that I am still not able to get the next link
> into my crawldb. I am thinking that my conf/regex-urlfilter.txt file
> is not properly set up. I am sending the content of this file, could
> you help me determining what is wrong with it?
> Thanks a ton in advanced!
>
>
> Renato M.
>
>
> # skip file: ftp: and mailto: urls
> -^(file|ftp|mailto):
>
> # skip image and other suffixes we can't yet parse
> # for a more extensive coverage use the urlfilter-suffix plugin
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
>
> #+http://www.xyz.com/\?page=*
> +http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*
>
> # skip URLs containing certain characters as probable queries, etc.
> #-[?*!@=]
> +.
>
> # skip URLs with slash-delimited segment that repeats 3+ times, to break loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> +.
>
> # accept anything else
> +.
>
> 2013/5/13 Tejas Patil <[email protected]>:
>> Hi Renato,
>>
>> The default content limit for http protocol is 65536 while the webpage is
>> much bigger than that. The relevant config needs to be updated.
>> Add this to the conf/nutch-site.xml:
>>
>> *<property>*
>> *  <name>http.content.limit</name>*
>> *  <value>240000</value>*
>> *  <description>The length limit for downloaded content using the http*
>> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>> *  than it will be truncated; otherwise, no truncation at all. Do not*
>> *  confuse this setting with the file.content.limit setting.*
>> *  </description>*
>> *</property>*
>>
>> I got a connection timed out error post this config change above (it makes
>> sense as the content to be downloaded is more).
>> So I added this to the conf/nutch-site.xml:
>>
>> *<property>*
>> *  <name>http.timeout</name>*
>> *  <value>1000000</value>*
>> *  <description>The default network timeout, in milliseconds.</description>*
>> *</property>*
>>
>> After running a fresh crawl, I could see the link to the next page in the
>> crawldb:
>>
>> *
>> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
>> key:
>>  net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
>> *
>> *baseUrl:        null*
>> *status: 1 (status_unfetched)*
>> *fetchTime:      1368424541731*
>> *prevFetchTime:  0*
>> *fetchInterval:  2592000*
>> *retriesSinceFetch:      0*
>> *modifiedTime:   0*
>> *prevModifiedTime:       0*
>> *protocolStatus: (null)*
>> *parseStatus:    (null)*
>> *title:  null*
>> *score:  0.0042918455*
>> *markers:        {dist=1}*
>> *reprUrl:        null*
>> *metadata _csh_ :        ;���*
>>
>> HTH
>>
>>
>> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
>> [email protected]> wrote:
>>
>>> Hi Tejas,
>>>
>>> So I started fresh. I deleted the webpage keyspace as I am using
>>> Cassandra as a backend. But I did get the same output. I mean I get a
>>> bunch of urls after I do a readdb -dump but not the ones I want. I get
>>> only one fetched site, and many links parsed (to be parsed in the next
>>> cycle?). Maybe it has to do something with the urls I am trying to
>>> get?
>>> I am trying to get this url and similar ones:
>>>
>>>
>>> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
>>>
>>> But I have noticed that the links pointing to the next ones are
>>> something like this:
>>>
>>> <a class="resultado_roda"
>>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
>>>
>>> So I decided to try commenting this url rule:
>>> # skip URLs with slash-delimited segment that repeats 3+ times, to break
>>> loops
>>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>>
>>> But I got the same results. A single site fetched, some urls parsed
>>> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
>>> Thanks a ton for your help Tejas!
>>>
>>>
>>> Renato M.
>>>
>>>
>>> 2013/5/12 Tejas Patil <[email protected]>:
>>> > Hi Renato,
>>> >
>>> > Thats weird. I ran a crawl over similar urls having a query in the end (
>>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
>>> > My guess is that there is something wrong while parsing due to which
>>> > outlinks are not getting into the crawldb.
>>> >
>>> > Start from fresh. Clear everything from previous attempts. (including the
>>> > backend table named as the value of 'storage.schema.webpage').
>>> > Run these :
>>> > bin/nutch inject *<urldir>*
>>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
>>> > bin/nutch fetch *<batchID>* -threads 2
>>> > bin/nutch parse *<batchID> *
>>> > bin/nutch updatedb
>>> > bin/nutch readdb -dump <*output dir*>
>>> >
>>> > The readdb output will shown if the outlinks were extracted correctly.
>>> >
>>> > The commands for checking urlfilter rules accept one input url at a time
>>> > from console (you need to type/paste the url and hit enter).
>>> > It shows "+" if the url is accepted by the current rules. ("-" for
>>> > rejection).
>>> >
>>> > Thanks,
>>> > Tejas
>>> >
>>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
>>> > [email protected]> wrote:
>>> >
>>> >> And I did try the commands you told me but I am not sure how they
>>> >> work. They do wait for an url to be input, but then it prints the url
>>> >> with a '+' at the beginning, what does that mean?
>>> >>
>>> >> http://www.xyz.com/lanchon
>>> >> +http://www.xyz.com/lanchon
>>> >>
>>> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
>>> >> > Hi Tejas,
>>> >> >
>>> >> > Thanks for your help. I have tried the expression you suggested, and
>>> >> > now my url-filter file is like this:
>>> >> > +http://www.xyz.com/\?page=*
>>> >> >
>>> >> > # skip URLs containing certain characters as probable queries, etc.
>>> >> > #-[?*!@=]
>>> >> > +.
>>> >> >
>>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
>>> break
>>> >> loops
>>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>>> >> > +.
>>> >> >
>>> >> > # accept anything else
>>> >> > +.
>>> >> >
>>> >> > So after this, I run a generate command -topN 5 -depth 5, and then a
>>> >> > fetch all, but I keep on getting a single page fetched. What am I
>>> >> > doing wrong? Thanks again for your help.
>>> >> >
>>> >> >
>>> >> > Renato M.
>>> >> >
>>> >> > 2013/5/12 Tejas Patil <[email protected]>:
>>> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter
>>> >> rules
>>> >> >> against any given url:
>>> >> >>
>>> >> >> bin/nutch plugin urlfilter-regex
>>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> >> >> OR
>>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
>>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
>>> >> >>
>>> >> >> Both of them accept input url one at a time from stdin.
>>> >> >> The later one has a param which can enable you to test a given url
>>> >> against
>>> >> >> several url filters at once. See its usage for more details.
>>> >> >>
>>> >> >>
>>> >> >>
>>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
>>> [email protected]
>>> >> >wrote:
>>> >> >>
>>> >> >>> If there is no restriction on the number at the end of the url, you
>>> >> might
>>> >> >>> just use this:
>>> >> >>> (note that the rule must be above the one which filters urls with a
>>> "?"
>>> >> >>> character)
>>> >> >>>
>>> >> >>> *+http://www.xyz.com/\?page=*
>>> >> >>> *
>>> >> >>> *
>>> >> >>> *# skip URLs containing certain characters as probable queries,
>>> etc.*
>>> >> >>> *-[?*!@=]*
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>>
>>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
>>> >> >>> [email protected]> wrote:
>>> >> >>>
>>> >> >>>> Hi all,
>>> >> >>>>
>>> >> >>>> I have been trying to fetch a query similar to:
>>> >> >>>>
>>> >> >>>> http://www.xyz.com/?page=1
>>> >> >>>>
>>> >> >>>> But where the number can vary from 1 to 100. Inside the first page
>>> >> >>>> there are links to the next ones. So I updated the
>>> >> >>>> conf/regex-urlfilter file and added:
>>> >> >>>>
>>> >> >>>> ^[0-9]{1,45}$
>>> >> >>>>
>>> >> >>>> When I do this, the generate job fails saying that it is "Invalid
>>> >> >>>> first character". I have tried generating with topN 5 and depth 5
>>> and
>>> >> >>>> trying to fetch more urls but that does not work.
>>> >> >>>>
>>> >> >>>> Could anyone advise me on how to accomplish this? I am running
>>> Nutch
>>> >> 2.x.
>>> >> >>>> Thanks in advance!
>>> >> >>>>
>>> >> >>>>
>>> >> >>>> Renato M.
>>> >> >>>>
>>> >> >>>
>>> >> >>>
>>> >>
>>>

Re: Fetching a specific number of urls

Reply via email to