Re: AW: Crawling with nutch, check Links

Sebastian Nagel Tue, 01 Aug 2017 07:58:00 -0700

> as the ticket is more than 2 years old, I assume it won't be fixed.. :-(


Not necessarily. Other features got in after more than two years ;)

On 07/31/2017 07:32 AM, d.ku...@technisat.de wrote:
> Hey Sebastian,
> 
> 
> thanks. What I did so far is: delete the database and start a whole new 
> crawl. 
> I saw that jira with orphaned pages, before. That is exactly, what I'm 
> looking for: as the ticket is more than 2 years old, I assume it won't be 
> fixed.. :-(
> 
> Thanks
> 
> David
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Sebastian Nagel [mailto:wastl.na...@googlemail.com] 
> Gesendet: Freitag, 28. Juli 2017 12:09
> An: user@nutch.apache.org
> Betreff: Re: Crawling with nutch, check Links
> 
> Hi David,
> 
> the easiest way is to delete the CrawlDb and to start the crawl from scratch.
> Since it's a site crawl this should be possible, at least, from time to time.
> Then delete documents from the index which haven't been updated.
> 
> A more sophisticated solution is not yet ready, see
>   https://issues.apache.org/jira/browse/NUTCH-1932
> 
> Best,
> Sebastian
> 
> On 07/27/2017 10:11 AM, d.ku...@technisat.de wrote:
>> Hey,
>>
>> currently I'm working on nutch with solr for our company pages.
>>
>> Assuming the following situation:
>> We have a website:
>>
>> www.mysite.lol<http://www.mysite.lol>
>>
>> at this site there is a Link:
>> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
>> 564/>
>>
>> As you can see there is a type I should be /testpage/:
>>
>> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512
>> -1564/>
>>
>> As our Framework doesn't care about the text before the ID, we could type 
>> everything we want and the site will be displayed because of the id. That is 
>> why both link are fine and there is no 404.
>> If I change the link from the mainpage to the correct one, let nutch crawl 
>> the site again, an send is to solr, the old one is still found.
>>
>> So the link
>> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1
>> 564/> is still at the nutch db, because the link is valid --> no 404. 
>> But there is no mainpage pointing to this website. How do I tell nutch to 
>> ignore sites, which doesn't have a link to it.
>> Basically --> revalidating links and removing site without links to it?
>>
>>
>>
>> Mit freundlichen Grüßen
>> David Kumar
>>
>> Senior Software Engineer Java, B. Sc.
>> Projektmanager PIM
>> Abteilung Infotech
>> TechniSat Digital GmbH
>> Julius-Saxler-Straße 3
>> TechniPark
>> D-54550 Daun / Germany
>>
>> Tel.: + 49 (0) 6592 / 712 -2826
>> Fax: + 49 (0) 6592 / 712 -2829
>>
>> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/>
>> www.facebook.com/technisat
>>
>>
>

Re: AW: Crawling with nutch, check Links

Reply via email to