> as the ticket is more than 2 years old, I assume it won't be fixed.. :-(
Not necessarily. Other features got in after more than two years ;) On 07/31/2017 07:32 AM, d.ku...@technisat.de wrote: > Hey Sebastian, > > > thanks. What I did so far is: delete the database and start a whole new > crawl. > I saw that jira with orphaned pages, before. That is exactly, what I'm > looking for: as the ticket is more than 2 years old, I assume it won't be > fixed.. :-( > > Thanks > > David > > > -----Ursprüngliche Nachricht----- > Von: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Gesendet: Freitag, 28. Juli 2017 12:09 > An: user@nutch.apache.org > Betreff: Re: Crawling with nutch, check Links > > Hi David, > > the easiest way is to delete the CrawlDb and to start the crawl from scratch. > Since it's a site crawl this should be possible, at least, from time to time. > Then delete documents from the index which haven't been updated. > > A more sophisticated solution is not yet ready, see > https://issues.apache.org/jira/browse/NUTCH-1932 > > Best, > Sebastian > > On 07/27/2017 10:11 AM, d.ku...@technisat.de wrote: >> Hey, >> >> currently I'm working on nutch with solr for our company pages. >> >> Assuming the following situation: >> We have a website: >> >> www.mysite.lol<http://www.mysite.lol> >> >> at this site there is a Link: >> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1 >> 564/> >> >> As you can see there is a type I should be /testpage/: >> >> www.mysite.lol/testpage/3512-1564/<http://www.mysite.lol/testpage/3512 >> -1564/> >> >> As our Framework doesn't care about the text before the ID, we could type >> everything we want and the site will be displayed because of the id. That is >> why both link are fine and there is no 404. >> If I change the link from the mainpage to the correct one, let nutch crawl >> the site again, an send is to solr, the old one is still found. >> >> So the link >> www.mysite.lol/tespage/3512-1564/<http://www.mysite.lol/tespage/3512-1 >> 564/> is still at the nutch db, because the link is valid --> no 404. >> But there is no mainpage pointing to this website. How do I tell nutch to >> ignore sites, which doesn't have a link to it. >> Basically --> revalidating links and removing site without links to it? >> >> >> >> Mit freundlichen Grüßen >> David Kumar >> >> Senior Software Engineer Java, B. Sc. >> Projektmanager PIM >> Abteilung Infotech >> TechniSat Digital GmbH >> Julius-Saxler-Straße 3 >> TechniPark >> D-54550 Daun / Germany >> >> Tel.: + 49 (0) 6592 / 712 -2826 >> Fax: + 49 (0) 6592 / 712 -2829 >> >> www.technisat.com/de_DE/<http://www.technisat.com/de_DE/> >> www.facebook.com/technisat >> >> >