> > > > Nutch cannot do this by default and is tricky to make because there may > > not be one unique referrer per page. > > > I don't realy need unique referrer. All I want is to inform requested > server > on which URL crawler found the link. >
You can write a custom scoringfilter to track the URL of the source, see the one in urlmeta for an example. It should be fairly straightforward to do > > There is some site which admin informed me that he has a lot of 404 errors > on logs from my Search server. Crawler is opening weard urls like > http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf;O=A but it should be > http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf, without *;O=A*. I > was > searching linkdb and it don't have any information about this good and bad > url. Without Referrer I can't find on which site is wrong link or code > directing to wrong urls > Feel free to share your findings > > > > Markus Jelsma-2 wrote > > > > What you can try is to add the referrer to outlinks when parsing records. > > This outlink can be added to CrawlDatum's MetaData which you can then > > later use to set the referrer. To set the referrer you must hack > Can you help me with it a little bit? Can I do it in configuration of > Nutch? > I am not good at JAVA programming also. I'm using Nutch as a crawler app > only. I was trying to find exact file/code where I can change it (http > plugin) but I didn't find any solution. > > > Regards > SZ > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/HTTP-REFERER-is-missing-tp3987967p3990533.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble