Re: Webpages are fetched multiple times

Hussain Pirosha Thu, 28 Jan 2016 04:06:06 -0800

My bad :( , after following the web-page 
https://wiki.apache.org/nutch/NutchTutorial I am now able to get it working.


Previously, I was using the nutch script for generate & fetch and not updating 
the crawl database after a fetch.

I have a small question regarding un-fetched pages (with status db_unfetched) 
bcoz of transient reason (due to lost of internet connectivity). In crawldb the 
next fetch time is marked to be the next day.

Is there a way to force nutch to fetch those pages right now ? . Also it tried 
only once fetching that page , whats the parameter to set the number of retries 
?

Thanks,
Hussain

________________________________________
From: Hussain Pirosha <hussain.piro...@impetus.co.in>
Sent: Wednesday, January 27, 2016 9:54 AM
To: user@nutch.apache.org
Subject: Re: Webpages are fetched multiple times

Hi Marcus,

I do see the same URLs in the logs being fetched multiple times. Verified it by 
reading the sequence file in the content directory through the following piece 
of code.

String segment = 
"/home/IMPETUS/hpirosha/softwares/apache-nutch-1.11/runtime/local/bin/ftalk-db/segments/20160125164343";
File outDir = new 
File("/home/IMPETUS/hpirosha/softwares/apache-nutch-1.11/runtime/local/bin/crawled-content");

        Path file = new Path(segment, Content.DIR_NAME + "/part-00000/data");
        SequenceFile.Reader reader = new SequenceFile.Reader(fs, file, conf);

        Text key = new Text();
        Content content = new Content();

        while (reader.next(key, content)) {
            String filename = key.toString().replaceFirst("http://";, 
"").replaceAll("/", "___").trim();

            File f = new File(outDir.getCanonicalPath() + "/" + filename);
            FileOutputStream fos = new FileOutputStream(f);
            fos.write(content.getContent());
            fos.close();
            System.out.println("URL :"+key.toString());
            System.out.println(f.getAbsolutePath());
        }

Through this code I can see that a web-page is present multiple times in the 
downloaded content.
Please could you give me some pointers where should I look into nutch's code or 
configuration ?

Thanks,
Hussain

________________________________________
From: Markus Jelsma <markus.jel...@openindex.io>
Sent: Monday, January 25, 2016 9:04 PM
To: user@nutch.apache.org
Subject: RE: Webpages are fetched multiple times

Hi - do you see the same URL's written to stdout when fetching? I have see that 
too a few times, but in no case was the URL actually downloaded twice, nor do 
they appear multiple times in the segment or CrawlDB.
Markus



-----Original message-----
> From:Hussain Pirosha <hussain.piro...@impetus.co.in>
> Sent: Monday 25th January 2016 14:30
> To: user@nutch.apache.org
> Subject: Webpages are fetched multiple times
>
> Hello,
>
>
> I have been experimenting with Apache Nutch version 1.11 for few days. My use 
> case is to crawl a forum in local mode. Seed url text just contains one entry 
> :
>
>
> http://www.flyertalk.com/forum/united-airlines-mileageplus/1736400-have-simple-question-about-united-airlines-mileageplus-ask-here-2016-a.html
>
>
> Nutch config is pasted @ http://pasted.co/782e59ad
>
>
> I issue the following commands :-
>
> 1. nutch generate ftalk-db/ ftalk-db/segments/ -depth 5 -topN 500
>
> 2. nutch fetch ftalk-db/segments/20160125154244/
>
>
> I am struggling to find, why nutch keeps on fetching same page multiple 
> times. Instead of getting unique web-pages at the end of crawl, I get lot of 
> duplicates.
>
>
> Please suggest what am I doing wrong ?
>
>
> Thanks,
>
> Hussain
>
> ________________________________
>
>
>
>
>
>
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.
>

________________________________






NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.

________________________________






NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.

Re: Webpages are fetched multiple times

Reply via email to