yes, you can use crawl to crawl pdf and extract text content from pdf file
using parse-tika parser plugin. but this plugin can not identify the
specific field such as person, location or dates, because it can not know
the format of that file. but you can extend your own parse plugin and do
some data extracting and write these data to search engine to search.




On Wed, Jan 1, 2014 at 9:12 PM, Philippe de Rochambeau <[email protected]>wrote:

> Hello,
>
> can you use Nutch to crawl PDFs and extract person, location, dates, times
> an money amounts as entities, as opposed to plain text strings?
>
> In  GATE mimir-cloud (http://gate.ac.uk/mimir/), you can search for
> {People}, {Location}, {Date}, and {Money} entities (if you have previously
> used the appropriate Processing Resources to index your data sources, in
> GATE Developer 7.1.) For instance, you can run search queries such as:
>
> « JOHN PAUL » IN {People}
> Paris IN {Location},
> {Date normalized>20010101 normalize<20100101}
> {Money > 2000}
> ...
>
> Can you do such things in Nutch?
>
> Many thanks.
>
> Philippe




-- 
Don't Grow Old, Grow Up... :-)

Reply via email to