Re: Fetching a specific number of urls

Tejas Patil Thu, 16 May 2013 14:26:31 -0700

On Thu, May 16, 2013 at 11:53 AM, Renato Marroquín Mogrovejo <
[email protected]> wrote:


> Well I have managed to get the same results as you have (I think). Now
> on my crawldb there are the links with the following structure:
>
> +http://www.xyz.com/\?page=*
>
> But there are also many other links, how would I do to only get the
> links in the above format? I mean ignoring all the others and only
> getting the ones with the same structure.
>

If you *just* want urls of type
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=<http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1>

then add a accept rule for that and reject the rest by using this:

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|
EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|
tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$

+
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*

# reject rest all urls
-.


> I have also noticed something interesting, that if I use:
>
> ./bin/nutch generate -topN 10 -numFetchers 1 -depth 10  -noFilter -adddays
> 0
>
> I only get the same seed url but no others, is this caused by the
> depth parameter?
>

Weird. Depth has nothing to do with this.
topN parameter could be set to a bigger value to see if this happens. I
vaguely remember (2-3 years back) that there was a jira about this and it
was said to be wont fix as people wont use low topN values in a typical
prod setup.


> Thanks again!
>
>
> Renato M.
>
>
> 2013/5/16 Renato Marroquín Mogrovejo <[email protected]>:
> > Hi Tejas,
> >
> > Thank you very much for your help again.
> > But I'm sorry to inform that I am still not able to get the next link
> > into my crawldb. I am thinking that my conf/regex-urlfilter.txt file
> > is not properly set up. I am sending the content of this file, could
> > you help me determining what is wrong with it?
> > Thanks a ton in advanced!
> >
> >
> > Renato M.
> >
> >
> > # skip file: ftp: and mailto: urls
> > -^(file|ftp|mailto):
> >
> > # skip image and other suffixes we can't yet parse
> > # for a more extensive coverage use the urlfilter-suffix plugin
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$
> >
> > #+http://www.xyz.com/\?page=*
> > +
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=*
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> > +.
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > +.
> >
> > # accept anything else
> > +.
> >
> > 2013/5/13 Tejas Patil <[email protected]>:
> >> Hi Renato,
> >>
> >> The default content limit for http protocol is 65536 while the webpage
> is
> >> much bigger than that. The relevant config needs to be updated.
> >> Add this to the conf/nutch-site.xml:
> >>
> >> *<property>*
> >> *  <name>http.content.limit</name>*
> >> *  <value>240000</value>*
> >> *  <description>The length limit for downloaded content using the http*
> >> *  protocol, in bytes. If this value is nonnegative (>=0), content
> longer*
> >> *  than it will be truncated; otherwise, no truncation at all. Do not*
> >> *  confuse this setting with the file.content.limit setting.*
> >> *  </description>*
> >> *</property>*
> >>
> >> I got a connection timed out error post this config change above (it
> makes
> >> sense as the content to be downloaded is more).
> >> So I added this to the conf/nutch-site.xml:
> >>
> >> *<property>*
> >> *  <name>http.timeout</name>*
> >> *  <value>1000000</value>*
> >> *  <description>The default network timeout, in
> milliseconds.</description>*
> >> *</property>*
> >>
> >> After running a fresh crawl, I could see the link to the next page in
> the
> >> crawldb:
> >>
> >> *
> >>
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> >> key:
> >>
>  net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
> >> *
> >> *baseUrl:        null*
> >> *status: 1 (status_unfetched)*
> >> *fetchTime:      1368424541731*
> >> *prevFetchTime:  0*
> >> *fetchInterval:  2592000*
> >> *retriesSinceFetch:      0*
> >> *modifiedTime:   0*
> >> *prevModifiedTime:       0*
> >> *protocolStatus: (null)*
> >> *parseStatus:    (null)*
> >> *title:  null*
> >> *score:  0.0042918455*
> >> *markers:        {dist=1}*
> >> *reprUrl:        null*
> >> *metadata _csh_ :        ;���*
> >>
> >> HTH
> >>
> >>
> >> On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
> >> [email protected]> wrote:
> >>
> >>> Hi Tejas,
> >>>
> >>> So I started fresh. I deleted the webpage keyspace as I am using
> >>> Cassandra as a backend. But I did get the same output. I mean I get a
> >>> bunch of urls after I do a readdb -dump but not the ones I want. I get
> >>> only one fetched site, and many links parsed (to be parsed in the next
> >>> cycle?). Maybe it has to do something with the urls I am trying to
> >>> get?
> >>> I am trying to get this url and similar ones:
> >>>
> >>>
> >>>
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
> >>>
> >>> But I have noticed that the links pointing to the next ones are
> >>> something like this:
> >>>
> >>> <a class="resultado_roda"
> >>> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
> >>>
> >>> So I decided to try commenting this url rule:
> >>> # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> >>> loops
> >>> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >>>
> >>> But I got the same results. A single site fetched, some urls parsed
> >>> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
> >>> Thanks a ton for your help Tejas!
> >>>
> >>>
> >>> Renato M.
> >>>
> >>>
> >>> 2013/5/12 Tejas Patil <[email protected]>:
> >>> > Hi Renato,
> >>> >
> >>> > Thats weird. I ran a crawl over similar urls having a query in the
> end (
> >>> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with
> 2.x.
> >>> > My guess is that there is something wrong while parsing due to which
> >>> > outlinks are not getting into the crawldb.
> >>> >
> >>> > Start from fresh. Clear everything from previous attempts.
> (including the
> >>> > backend table named as the value of 'storage.schema.webpage').
> >>> > Run these :
> >>> > bin/nutch inject *<urldir>*
> >>> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
> >>> > bin/nutch fetch *<batchID>* -threads 2
> >>> > bin/nutch parse *<batchID> *
> >>> > bin/nutch updatedb
> >>> > bin/nutch readdb -dump <*output dir*>
> >>> >
> >>> > The readdb output will shown if the outlinks were extracted
> correctly.
> >>> >
> >>> > The commands for checking urlfilter rules accept one input url at a
> time
> >>> > from console (you need to type/paste the url and hit enter).
> >>> > It shows "+" if the url is accepted by the current rules. ("-" for
> >>> > rejection).
> >>> >
> >>> > Thanks,
> >>> > Tejas
> >>> >
> >>> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
> >>> > [email protected]> wrote:
> >>> >
> >>> >> And I did try the commands you told me but I am not sure how they
> >>> >> work. They do wait for an url to be input, but then it prints the
> url
> >>> >> with a '+' at the beginning, what does that mean?
> >>> >>
> >>> >> http://www.xyz.com/lanchon
> >>> >> +http://www.xyz.com/lanchon
> >>> >>
> >>> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
> >>> >> > Hi Tejas,
> >>> >> >
> >>> >> > Thanks for your help. I have tried the expression you suggested,
> and
> >>> >> > now my url-filter file is like this:
> >>> >> > +http://www.xyz.com/\?page=*
> >>> >> >
> >>> >> > # skip URLs containing certain characters as probable queries,
> etc.
> >>> >> > #-[?*!@=]
> >>> >> > +.
> >>> >> >
> >>> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
> >>> break
> >>> >> loops
> >>> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >>> >> > +.
> >>> >> >
> >>> >> > # accept anything else
> >>> >> > +.
> >>> >> >
> >>> >> > So after this, I run a generate command -topN 5 -depth 5, and
> then a
> >>> >> > fetch all, but I keep on getting a single page fetched. What am I
> >>> >> > doing wrong? Thanks again for your help.
> >>> >> >
> >>> >> >
> >>> >> > Renato M.
> >>> >> >
> >>> >> > 2013/5/12 Tejas Patil <[email protected]>:
> >>> >> >> FYI: You can use anyone of these commands to run the
> regex-urlfilter
> >>> >> rules
> >>> >> >> against any given url:
> >>> >> >>
> >>> >> >> bin/nutch plugin urlfilter-regex
> >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >>> >> >> OR
> >>> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >>> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >>> >> >>
> >>> >> >> Both of them accept input url one at a time from stdin.
> >>> >> >> The later one has a param which can enable you to test a given
> url
> >>> >> against
> >>> >> >> several url filters at once. See its usage for more details.
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
> >>> [email protected]
> >>> >> >wrote:
> >>> >> >>
> >>> >> >>> If there is no restriction on the number at the end of the url,
> you
> >>> >> might
> >>> >> >>> just use this:
> >>> >> >>> (note that the rule must be above the one which filters urls
> with a
> >>> "?"
> >>> >> >>> character)
> >>> >> >>>
> >>> >> >>> *+http://www.xyz.com/\?page=*
> >>> >> >>> *
> >>> >> >>> *
> >>> >> >>> *# skip URLs containing certain characters as probable queries,
> >>> etc.*
> >>> >> >>> *-[?*!@=]*
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>>
> >>> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> >>> >> >>> [email protected]> wrote:
> >>> >> >>>
> >>> >> >>>> Hi all,
> >>> >> >>>>
> >>> >> >>>> I have been trying to fetch a query similar to:
> >>> >> >>>>
> >>> >> >>>> http://www.xyz.com/?page=1
> >>> >> >>>>
> >>> >> >>>> But where the number can vary from 1 to 100. Inside the first
> page
> >>> >> >>>> there are links to the next ones. So I updated the
> >>> >> >>>> conf/regex-urlfilter file and added:
> >>> >> >>>>
> >>> >> >>>> ^[0-9]{1,45}$
> >>> >> >>>>
> >>> >> >>>> When I do this, the generate job fails saying that it is
> "Invalid
> >>> >> >>>> first character". I have tried generating with topN 5 and
> depth 5
> >>> and
> >>> >> >>>> trying to fetch more urls but that does not work.
> >>> >> >>>>
> >>> >> >>>> Could anyone advise me on how to accomplish this? I am running
> >>> Nutch
> >>> >> 2.x.
> >>> >> >>>> Thanks in advance!
> >>> >> >>>>
> >>> >> >>>>
> >>> >> >>>> Renato M.
> >>> >> >>>>
> >>> >> >>>
> >>> >> >>>
> >>> >>
> >>>
>

Re: Fetching a specific number of urls

Reply via email to