Re: HTTP REFERER is missing

Julien Nioche Thu, 21 Jun 2012 00:15:10 -0700

> >
> > Nutch cannot do this by default and is tricky to make because there may
> > not be one unique referrer per page.
> >
> I don't realy need unique referrer. All I want is to inform requested
> server
> on which URL crawler found the link.
>


You can write a custom scoringfilter to track the URL of the source, see
the one in urlmeta for an example. It should be fairly straightforward to do


>
> There is some site which admin informed me that he has a lot of 404 errors
> on logs from my Search server. Crawler is opening weard urls like
> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf;O=A but it should be
> http://www.domain.com/~tdz/sbd/zabezpieczanie_baz.pdf, without *;O=A*. I
> was
> searching linkdb and it don't have any information about this good and bad
> url. Without Referrer I can't find on which site is wrong link or code
> directing to wrong urls
>

Feel free to share your findings


>
>
>
> Markus Jelsma-2 wrote
> >
> > What you can try is to add the referrer to outlinks when parsing records.
> > This outlink can be added to CrawlDatum's MetaData which you can then
> > later use to set the referrer. To set the referrer you must hack
> Can you help me with it a little bit? Can I do it in configuration of
> Nutch?
> I am not good at JAVA programming also. I'm using Nutch as a crawler app
> only. I was trying to find exact file/code where I can change it (http
> plugin) but I didn't find any solution.
>
>
> Regards
> SZ
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/HTTP-REFERER-is-missing-tp3987967p3990533.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: HTTP REFERER is missing

Reply via email to