RE: timestamp in nutch schema

Markus Jelsma Sun, 04 Nov 2012 01:23:42 -0800

Hi - the timestamp is just the time when a page is being indexed. Not very 
useful except for deduplication. If you want to index some publishing date you 
must first identify the source of that date and get it out of webpages. It's 
possible to use og:date or other meta meta tags or perhaps other sources but to 
do so you must create a custom parse filter.


Meta tags can be indexed without creating a custom parse filter. If you don't 
trust websites or need special (re)formatting or checking logic you need to 
make a parse filter for it.

I've also built a date parsing filter to retrieve dates in various formats from 
free text, check Jira for a patch for the dateparsefilter. It's an older 
version but still works well.

-----Original message-----
> From:Joe Zhang <smartag...@gmail.com>
> Sent: Sun 04-Nov-2012 05:44
> To: user <user@nutch.apache.org>
> Subject: timestamp in nutch schema
> 
> My understanding is that the timestamp stores crawling time. Is there any
> way to get nutch to parse out the publishing time of webpages and store
> such info in timestamp or some other field?
>

RE: timestamp in nutch schema

Reply via email to