Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Tejas Patil Sun, 02 Feb 2014 07:51:41 -0800

On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata
<bwidyasany...@gmail.com>wrote:


> Hi Tejas,
>
> It's works and great! :)
> After reconfigured and many times of generate, fetch, parse & update, the
> pages on 2nd level is being crawled.
>
> 1 question, Is it fine and correct if I modified my current
> crawler+indexing script into this pseudo (skeleton):
>
> >>>>>>>>>>>>>>>>>>>>>>>>>>>
> # example number of levels / depth (loop)
> LOOP=4
>
> nutch->inject()
>
> loop[ =< $LOOP]
> {
>     nutch->generate()
>     nutch->fetch(a_segment)
>     nutch->parse(a_segment)
>     nutch->updatedb(a_segment)
> }
>
> nutch->solrindex()
>
> I don't think that this should be a problem. Remember to pass all the
segments generated in the crawl loop to the solrindex job using "-dir"
option.

>>>>>>>>>>>>>>>>>>>>>>>>>>>
>
> Thank you!
>
>
> On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata
> <bwidyasany...@gmail.com>wrote:
>
> > OK I will apply it first and update the result.
> >
> > Thanks.-
> >
> >
> > On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <tejas.patil...@gmail.com
> >wrote:
> >
> >> Please copy this at the end (but above the end tag '</configuration>')
> in
> >> your $NUTCH/conf/nutch-site.xml:
> >>
> >> <property>
> >>   <name>http.content.limit</name>
> >>   <value>999999999</value>
> >> </property>
> >>
> >> <property>
> >>   <name>http.timeout</name>
> >>   <value>2147483640</value>
> >> </property>
> >>
> >> <property>
> >>   <name>db.max.outlinks.per.page</name>
> >>   <value>999999999</value>
> >> </property>
> >>
> >> Please check if the url got fetched correctly after every round:
> >> For the first round with seed as http://bappenas.go.id, after
> "updatedb"
> >> job, run these to check if they are into the crawldb. The first url must
> >> be
> >> db_fetched while the second one must be db_unfetched:
> >>
> >> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/
> >> bin/nutch readdb <YOUR_CRAWLDB> -url
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> Now crawl for the next depth. After "updatedb"job, check if the second
> url
> >> got fetched using the same command again. ie.
> >> bin/nutch readdb <YOUR_CRAWLDB> -url
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> Note that if there was any redirection, you need to look out the target
> >> url
> >> in the redirection chain and use that url ahead for debugging. Verify if
> >> the content you got for that url had text "Liberal Party" in the parsed
> >> output using this command:
> >>
> >> bin/nutch readseg -get <LATEST_SEGMENT>
> >>
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/
> >>
> >> For larger segments, you might get a OOM error. So in that case, take
> the
> >> entire segment dump using:
> >> bin/nutch readseg -dump <LATEST_SEGMENT>  <OUTPUT>
> >>
> >> After all this is verified and everything looks good from the crawling
> >> side, run solrindex and check if you get the query results. If not, then
> >> there was a problem while indexing the stuff.
> >>
> >> Thanks,
> >> Tejas
> >>
> >>
> >> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata
> >> <bwidyasany...@gmail.com>wrote:
> >>
> >> > Hi,
> >> >
> >> > I just realized that my nutch didn't crawl the articles/pages (depth
> 2)
> >> > which shown on frontpage.
> >> > My target URL is: http://bappenas.go.id
> >> >
> >> > As shown on that frontpage (top right below the slider banners) three
> >> is a
> >> > text link:
> >> >
> >> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot"
> >> > and its URL:
> >> >
> >> >
> >>
> http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937
> >> >
> >> > I tried to search with keyword "Liberal Party" (with quotes) which
> >> appear
> >> > on link (page) above but has no result :(
> >> >
> >> > Following is the search link queried:
> >> >
> >> >
> >>
> http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22
> >> >
> >> > I use individual script to crawl below:
> >> >
> >> > ===
> >> > # Defines env variables
> >> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45"
> >> > export PATH="$JAVA_HOME/bin:$PATH"
> >> > NUTCH="/opt/searchengine/nutch"
> >> >
> >> > # Start by injecting the seed url(s) to the nutch crawldb:
> >> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb
> >> $NUTCH/urls/seed.txt
> >> >
> >> > # Generate fetch list
> >> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb
> >> > $NUTCH/BappenasCrawl/segments
> >> >
> >> > # last segment
> >> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr
> >> > $NUTCH/BappenasCrawl/segments|tail -1`
> >> >
> >> > # Launch the crawler!
> >> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing
> >> >
> >> > # Parse the fetched content:
> >> > $NUTCH/bin/nutch parse $SEGMENT
> >> >
> >> > # We need to update the crawl database to ensure that for all future
> >> > crawls, Nutch only checks the already crawled pages, and only fetches
> >> new
> >> > and changed pages.
> >> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT
> -filter
> >> > -normalize
> >> >
> >> > # Indexing our crawl DB with solr
> >> > $NUTCH/bin/nutch solrindex
> >> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb
> >> > -dir<
> >>
> http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir
> >> >$NUTCH/BappenasCrawl/segments
> >> > ===
> >> >
> >> > I run this script daily but it looks it never reach the single article
> >> > pages which shown on the frontpage.
> >> >
> >> > If I read Tejas explained on another thread (shown below), should I
> two
> >> or
> >> > three times loops (generate -> fetch -> parse -> update) to produce 2
> >> or 3
> >> > depth levels?
> >> >
> >> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch):
> >> > *****************
> >> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil <
> tejas.patil...@gmail.com
> >> > >wrote:
> >> > Yes. Nutch would parse the HTML and extract the content out of it.
> >> Tweaking
> >> > around the code surrounding the parser would have made that happen. If
> >> you
> >> > did something else, would you mind sharing it ?
> >> >
> >> > The "depth" is used by the Crawl class in 1.x which is deprecated in
> >> 2.x.
> >> > Use bin/crawl instead.
> >> > While running the "bin/crawl" script, the "<numberOfRounds>" option is
> >> > nothing but the depth till which you want the crawling to be
> performed.
> >> >
> >> > If you want to use the individual commands instead, run generate ->
> >> fetch
> >> > -> parse -> update multiple times. The crawl script internally does
> the
> >> > same thing.
> >> > eg. If you want to fetch till depth 3, this is how you could do:
> >> > inject -> (generate -> fetch -> parse -> update)
> >> >           -> (generate -> fetch -> parse -> update)
> >> >           -> (generate -> fetch -> parse -> update)
> >> >                -> solrindex
> >> > *****************
> >> >
> >> > I also has commented line below on regex-urlfilter.txt file:
> >> > # skip URLs containing certain characters as probable queries, etc.
> >> > #-[?*!@=]
> >> >
> >> > Apps: nutch 1.7 and Solr 4.5.1
> >> >
> >> > Thank you so much!
> >> >
> >> > --
> >> > wassalam,
> >> > [bayu]
> >> >
> >>
> >
> >
> >
> > --
> > wassalam,
> > [bayu]
> >
>
>
>
> --
> wassalam,
> [bayu]
>

Re: Strange: Nutch didn't crawl level 2 (depth 2) pages

Reply via email to