Hi Renato,
Thats weird. I ran a crawl over similar urls having a query in the end (
http://ngs.ics.uci.edu/blog/?p=338) and it worked fine for me with 2.x.
My guess is that there is something wrong while parsing due to which
outlinks are not getting into the crawldb.
Start from fresh. Clear everything from previous attempts. (including the
backend table named as the value of 'storage.schema.webpage').
Run these :
bin/nutch inject *<urldir>*
bin/nutch generate -topN 50 -numFetchers 1 -noFilter -adddays 0
bin/nutch fetch *<batchID>* -threads 2
bin/nutch parse *<batchID> *
bin/nutch updatedb
bin/nutch readdb -dump <*output dir*>
The readdb output will shown if the outlinks were extracted correctly.
The commands for checking urlfilter rules accept one input url at a time
from console (you need to type/paste the url and hit enter).
It shows "+" if the url is accepted by the current rules. ("-" for
rejection).
Thanks,
Tejas
On Sun, May 12, 2013 at 10:30 AM, Renato Marroquín Mogrovejo <
[email protected]> wrote:
> And I did try the commands you told me but I am not sure how they
> work. They do wait for an url to be input, but then it prints the url
> with a '+' at the beginning, what does that mean?
>
> http://www.xyz.com/lanchon
> +http://www.xyz.com/lanchon
>
> 2013/5/12 Renato Marroquín Mogrovejo <[email protected]>:
> > Hi Tejas,
> >
> > Thanks for your help. I have tried the expression you suggested, and
> > now my url-filter file is like this:
> > +http://www.xyz.com/\?page=*
> >
> > # skip URLs containing certain characters as probable queries, etc.
> > #-[?*!@=]
> > +.
> >
> > # skip URLs with slash-delimited segment that repeats 3+ times, to break
> loops
> > #-.*(/[^/]+)/[^/]+\1/[^/]+\1/
> > +.
> >
> > # accept anything else
> > +.
> >
> > So after this, I run a generate command -topN 5 -depth 5, and then a
> > fetch all, but I keep on getting a single page fetched. What am I
> > doing wrong? Thanks again for your help.
> >
> >
> > Renato M.
> >
> > 2013/5/12 Tejas Patil <[email protected]>:
> >> FYI: You can use anyone of these commands to run the regex-urlfilter
> rules
> >> against any given url:
> >>
> >> bin/nutch plugin urlfilter-regex
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >> OR
> >> bin/nutch org.apache.nutch.net.URLFilterChecker -filterName
> >> org.apache.nutch.urlfilter.regex.RegexURLFilter
> >>
> >> Both of them accept input url one at a time from stdin.
> >> The later one has a param which can enable you to test a given url
> against
> >> several url filters at once. See its usage for more details.
> >>
> >>
> >>
> >> On Sun, May 12, 2013 at 2:02 AM, Tejas Patil <[email protected]
> >wrote:
> >>
> >>> If there is no restriction on the number at the end of the url, you
> might
> >>> just use this:
> >>> (note that the rule must be above the one which filters urls with a "?"
> >>> character)
> >>>
> >>> *+http://www.xyz.com/\?page=*
> >>> *
> >>> *
> >>> *# skip URLs containing certain characters as probable queries, etc.*
> >>> *-[?*!@=]*
> >>>
> >>>
> >>>
> >>>
> >>> On Sun, May 12, 2013 at 12:40 AM, Renato Marroquín Mogrovejo <
> >>> [email protected]> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> I have been trying to fetch a query similar to:
> >>>>
> >>>> http://www.xyz.com/?page=1
> >>>>
> >>>> But where the number can vary from 1 to 100. Inside the first page
> >>>> there are links to the next ones. So I updated the
> >>>> conf/regex-urlfilter file and added:
> >>>>
> >>>> ^[0-9]{1,45}$
> >>>>
> >>>> When I do this, the generate job fails saying that it is "Invalid
> >>>> first character". I have tried generating with topN 5 and depth 5 and
> >>>> trying to fetch more urls but that does not work.
> >>>>
> >>>> Could anyone advise me on how to accomplish this? I am running Nutch
> 2.x.
> >>>> Thanks in advance!
> >>>>
> >>>>
> >>>> Renato M.
> >>>>
> >>>
> >>>
>