Re: Fetching a specific number of urls

Tejas Patil Sun, 12 May 2013 23:04:32 -0700

Hi Renato,

The default content limit for http protocol is 65536 while the webpage is
much bigger than that. The relevant config needs to be updated.
Add this to the conf/nutch-site.xml:


*<property>*
*  <name>http.content.limit</name>*
*  <value>240000</value>*
*  <description>The length limit for downloaded content using the http*
*  protocol, in bytes. If this value is nonnegative (>=0), content longer*
*  than it will be truncated; otherwise, no truncation at all. Do not*
*  confuse this setting with the file.content.limit setting.*
*  </description>*
*</property>*

I got a connection timed out error post this config change above (it makes
sense as the content to be downloaded is more).
So I added this to the conf/nutch-site.xml:

*<property>*
*  <name>http.timeout</name>*
*  <value>1000000</value>*
*  <description>The default network timeout, in milliseconds.</description>*
*</property>*

After running a fresh crawl, I could see the link to the next page in the
crawldb:

*
http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
key:
 net.telelistas.www:http/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2
*
*baseUrl:        null*
*status: 1 (status_unfetched)*
*fetchTime:      1368424541731*
*prevFetchTime:  0*
*fetchInterval:  2592000*
*retriesSinceFetch:      0*
*modifiedTime:   0*
*prevModifiedTime:       0*
*protocolStatus: (null)*
*parseStatus:    (null)*
*title:  null*
*score:  0.0042918455*
*markers:        {dist=1}*
*reprUrl:        null*
*metadata _csh_ :        ;���*

HTH


On Sun, May 12, 2013 at 10:21 PM, Renato Marroquín Mogrovejo <
[email protected]> wrote:

> Hi Tejas,
>
> So I started fresh. I deleted the webpage keyspace as I am using
> Cassandra as a backend. But I did get the same output. I mean I get a
> bunch of urls after I do a readdb -dump but not the ones I want. I get
> only one fetched site, and many links parsed (to be parsed in the next
> cycle?). Maybe it has to do something with the urls I am trying to
> get?
> I am trying to get this url and similar ones:
>
>
> http://www.telelistas.net/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=1
>
> But I have noticed that the links pointing to the next ones are
> something like this:
>
> <a class="resultado_roda"
> href="/rj/rio+de+janeiro/lanchonetes+restaurantes/?pagina=2">2 </a>
>
> So I decided to try commenting this url rule:
> # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
> But I got the same results. A single site fetched, some urls parsed
> but not the ones I want using the regex-urlfilter.txt. Any Ideas?
> Thanks a ton for your help Tejas!
>
>
> Renato M.
>
>
> 2013/5/12 Tejas Patil <[email protected]>:
> > Hi Renato,
> >
> > Thats weird. I ran a crawl over similar urls having a query in the end (
> > http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
> > My guess is that there is something wrong while parsing due to which
> > outlinks are not getting into the crawldb.
> >
> > Start from fresh. Clear everything from previous attempts. (including the
> > backend table named as the value of 'storage.schema.webpage').
> > Run these :
> > bin/nutch inject *<urldir>*
> > bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
> > bin/nutch fetch *<batchID>* -threads 2
> > bin/nutch parse *<batchID> *
> > bin/nutch updatedb
> > bin/nutch readdb -dump <*output dir*>
> >
> > The readdb output will shown if the outlinks were extracted correctly.
> >
> > The commands for checking urlfilter rules accept one input url at a time
> > from console (you need to type/paste the url and hit enter).
> > It shows "+" if the url is accepted by the current rules. ("-" for
> > rejection).
> >
> > Thanks,
> > Tejas
> >
> > On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
> > [email protected]> wrote:
> >
> >> And I did try the commands you told me but I am not sure how they
> >> work. They do wait for an url to be input, but then it prints the url
> >> with a '+' at the beginning, what does that mean?
> >>
> >> http://www.xyz.com/lanchon
> >> +http://www.xyz.com/lanchon
> >>
> >> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
> >> > Hi Tejas,
> >> >
> >> > Thanks for your help. I have tried the expression you suggested, and
> >> > now my url-filter file is like this:
> >> > +http://www.xyz.com/\?page=*
> >> >
> >> > # skip URLs containing certain characters as probable queries, etc.
> >> > #-[?*!@=]
> >> > +.
> >> >
> >> > # skip URLs with slash-delimited segment that repeats 3+ times, to
> break
> >> loops
> >> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> >> > +.
> >> >
> >> > # accept anything else
> >> > +.
> >> >
> >> > So after this, I run a generate command -topN 5 -depth 5, and then a
> >> > fetch all, but I keep on getting a single page fetched. What am I
> >> > doing wrong? Thanks again for your help.
> >> >
> >> >
> >> > Renato M.
> >> >
> >> > 2013/5/12 Tejas Patil <[email protected]>:
> >> >> FYI: You can use anyone of these commands to run the regex-urlfilter
> >> rules
> >> >> against any given url:
> >> >>
> >> >> bin/nutch plugin urlfilter-regex
> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> >> OR
> >> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> >>
> >> >> Both of them accept input url one at a time from stdin.
> >> >> The later one has a param which can enable you to test a given url
> >> against
> >> >> several url filters at once. See its usage for more details.
> >> >>
> >> >>
> >> >>
> >> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <
> [email protected]
> >> >wrote:
> >> >>
> >> >>> If there is no restriction on the number at the end of the url, you
> >> might
> >> >>> just use this:
> >> >>> (note that the rule must be above the one which filters urls with a
> "?"
> >> >>> character)
> >> >>>
> >> >>> *+http://www.xyz.com/\?page=*
> >> >>> *
> >> >>> *
> >> >>> *# skip URLs containing certain characters as probable queries,
> etc.*
> >> >>> *-[?*!@=]*
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> >> >>> [email protected]> wrote:
> >> >>>
> >> >>>> Hi all,
> >> >>>>
> >> >>>> I have been trying to fetch a query similar to:
> >> >>>>
> >> >>>> http://www.xyz.com/?page=1
> >> >>>>
> >> >>>> But where the number can vary from 1 to 100. Inside the first page
> >> >>>> there are links to the next ones. So I updated the
> >> >>>> conf/regex-urlfilter file and added:
> >> >>>>
> >> >>>> ^[0-9]{1,45}$
> >> >>>>
> >> >>>> When I do this, the generate job fails saying that it is "Invalid
> >> >>>> first character". I have tried generating with topN 5 and depth 5
> and
> >> >>>> trying to fetch more urls but that does not work.
> >> >>>>
> >> >>>> Could anyone advise me on how to accomplish this? I am running
> Nutch
> >> 2.x.
> >> >>>> Thanks in advance!
> >> >>>>
> >> >>>>
> >> >>>> Renato M.
> >> >>>>
> >> >>>
> >> >>>
> >>
>

Re: Fetching a specific number of urls

Reply via email to