On Sun, Feb 2, 2014 at 5:54 PM, Bayu Widyasanyata <bwidyasany...@gmail.com>wrote:
> Hi Tejas, > > It's works and great! :) > After reconfigured and many times of generate, fetch, parse & update, the > pages on 2nd level is being crawled. > > 1 question, Is it fine and correct if I modified my current > crawler+indexing script into this pseudo (skeleton): > > >>>>>>>>>>>>>>>>>>>>>>>>>>> > # example number of levels / depth (loop) > LOOP=4 > > nutch->inject() > > loop[ =< $LOOP] > { > nutch->generate() > nutch->fetch(a_segment) > nutch->parse(a_segment) > nutch->updatedb(a_segment) > } > > nutch->solrindex() > > I don't think that this should be a problem. Remember to pass all the segments generated in the crawl loop to the solrindex job using "-dir" option. >>>>>>>>>>>>>>>>>>>>>>>>>>> > > Thank you! > > > On Mon, Jan 27, 2014 at 3:46 AM, Bayu Widyasanyata > <bwidyasany...@gmail.com>wrote: > > > OK I will apply it first and update the result. > > > > Thanks.- > > > > > > On Sun, Jan 26, 2014 at 11:01 PM, Tejas Patil <tejas.patil...@gmail.com > >wrote: > > > >> Please copy this at the end (but above the end tag '</configuration>') > in > >> your $NUTCH/conf/nutch-site.xml: > >> > >> <property> > >> <name>http.content.limit</name> > >> <value>999999999</value> > >> </property> > >> > >> <property> > >> <name>http.timeout</name> > >> <value>2147483640</value> > >> </property> > >> > >> <property> > >> <name>db.max.outlinks.per.page</name> > >> <value>999999999</value> > >> </property> > >> > >> Please check if the url got fetched correctly after every round: > >> For the first round with seed as http://bappenas.go.id, after > "updatedb" > >> job, run these to check if they are into the crawldb. The first url must > >> be > >> db_fetched while the second one must be db_unfetched: > >> > >> bin/nutch readdb <YOUR_CRAWLDB> -url http://bappenas.go.id/ > >> bin/nutch readdb <YOUR_CRAWLDB> -url > >> > >> > http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/ > >> > >> Now crawl for the next depth. After "updatedb"job, check if the second > url > >> got fetched using the same command again. ie. > >> bin/nutch readdb <YOUR_CRAWLDB> -url > >> > >> > http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/ > >> > >> Note that if there was any redirection, you need to look out the target > >> url > >> in the redirection chain and use that url ahead for debugging. Verify if > >> the content you got for that url had text "Liberal Party" in the parsed > >> output using this command: > >> > >> bin/nutch readseg -get <LATEST_SEGMENT> > >> > >> > http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/ > >> > >> For larger segments, you might get a OOM error. So in that case, take > the > >> entire segment dump using: > >> bin/nutch readseg -dump <LATEST_SEGMENT> <OUTPUT> > >> > >> After all this is verified and everything looks good from the crawling > >> side, run solrindex and check if you get the query results. If not, then > >> there was a problem while indexing the stuff. > >> > >> Thanks, > >> Tejas > >> > >> > >> On Sun, Jan 26, 2014 at 9:09 AM, Bayu Widyasanyata > >> <bwidyasany...@gmail.com>wrote: > >> > >> > Hi, > >> > > >> > I just realized that my nutch didn't crawl the articles/pages (depth > 2) > >> > which shown on frontpage. > >> > My target URL is: http://bappenas.go.id > >> > > >> > As shown on that frontpage (top right below the slider banners) three > >> is a > >> > text link: > >> > > >> > "Kerjasama Pembangunan Indonesia-Australia Setelah PM Tony Abbot" > >> > and its URL: > >> > > >> > > >> > http://bappenas.go.id/berita-dan-siaran-pers/kerjasama-pembangunan-indonesia-australia-setelah-pm-tony-abbot/?&kid=1390691937 > >> > > >> > I tried to search with keyword "Liberal Party" (with quotes) which > >> appear > >> > on link (page) above but has no result :( > >> > > >> > Following is the search link queried: > >> > > >> > > >> > http://bappenas.go.id/index.php/bappenas_search/result?q=%22Liberal+Party%22 > >> > > >> > I use individual script to crawl below: > >> > > >> > === > >> > # Defines env variables > >> > export JAVA_HOME="/opt/searchengine/jdk1.7.0_45" > >> > export PATH="$JAVA_HOME/bin:$PATH" > >> > NUTCH="/opt/searchengine/nutch" > >> > > >> > # Start by injecting the seed url(s) to the nutch crawldb: > >> > $NUTCH/bin/nutch inject $NUTCH/BappenasCrawl/crawldb > >> $NUTCH/urls/seed.txt > >> > > >> > # Generate fetch list > >> > $NUTCH/bin/nutch generate $NUTCH/BappenasCrawl/crawldb > >> > $NUTCH/BappenasCrawl/segments > >> > > >> > # last segment > >> > export SEGMENT=$NUTCH/BappenasCrawl/segments/`ls -tr > >> > $NUTCH/BappenasCrawl/segments|tail -1` > >> > > >> > # Launch the crawler! > >> > $NUTCH/bin/nutch fetch $SEGMENT -noParsing > >> > > >> > # Parse the fetched content: > >> > $NUTCH/bin/nutch parse $SEGMENT > >> > > >> > # We need to update the crawl database to ensure that for all future > >> > crawls, Nutch only checks the already crawled pages, and only fetches > >> new > >> > and changed pages. > >> > $NUTCH/bin/nutch updatedb $NUTCH/BappenasCrawl/crawldb $SEGMENT > -filter > >> > -normalize > >> > > >> > # Indexing our crawl DB with solr > >> > $NUTCH/bin/nutch solrindex > >> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb > >> > -dir< > >> > http://localhost:8080/solr/bappenasgoid/$NUTCH/BappenasCrawl/crawldb-dir > >> >$NUTCH/BappenasCrawl/segments > >> > === > >> > > >> > I run this script daily but it looks it never reach the single article > >> > pages which shown on the frontpage. > >> > > >> > If I read Tejas explained on another thread (shown below), should I > two > >> or > >> > three times loops (generate -> fetch -> parse -> update) to produce 2 > >> or 3 > >> > depth levels? > >> > > >> > QUOTES from Tejas' e-mail (subject: Questions/issues with nutch): > >> > ***************** > >> > On Sat, Jun 29, 2013 at 2:49 PM, Tejas Patil < > tejas.patil...@gmail.com > >> > >wrote: > >> > Yes. Nutch would parse the HTML and extract the content out of it. > >> Tweaking > >> > around the code surrounding the parser would have made that happen. If > >> you > >> > did something else, would you mind sharing it ? > >> > > >> > The "depth" is used by the Crawl class in 1.x which is deprecated in > >> 2.x. > >> > Use bin/crawl instead. > >> > While running the "bin/crawl" script, the "<numberOfRounds>" option is > >> > nothing but the depth till which you want the crawling to be > performed. > >> > > >> > If you want to use the individual commands instead, run generate -> > >> fetch > >> > -> parse -> update multiple times. The crawl script internally does > the > >> > same thing. > >> > eg. If you want to fetch till depth 3, this is how you could do: > >> > inject -> (generate -> fetch -> parse -> update) > >> > -> (generate -> fetch -> parse -> update) > >> > -> (generate -> fetch -> parse -> update) > >> > -> solrindex > >> > ***************** > >> > > >> > I also has commented line below on regex-urlfilter.txt file: > >> > # skip URLs containing certain characters as probable queries, etc. > >> > #-[?*!@=] > >> > > >> > Apps: nutch 1.7 and Solr 4.5.1 > >> > > >> > Thank you so much! > >> > > >> > -- > >> > wassalam, > >> > [bayu] > >> > > >> > > > > > > > > -- > > wassalam, > > [bayu] > > > > > > -- > wassalam, > [bayu] >