PDF Content Extraction

McGibbney, Lewis John Mon, 24 Jan 2011 06:13:57 -0800

Hello list,

I have been using Nutch 1.2 to crawl the web for a small number of very 
relevant html pages and associated URL’s containing PDF document’s. I have then 
been using Luke v 1.0.1 to look inside my index to guarantee I have indexed 
specific PDF documents which reside on these web pages. When I search my index 
I am returned a hyperlink (amongst other information) for a relevant hit. It is 
my intention to implement a content extraction mechanism to also provide 
relevant information contained within the pdf documents which reside in my 
index whenever a user submits a query. E.g. if someone were to submit a query 
relating to a clause within a legal document, the content extraction tool would 
parse the pdf file and provide a snippet of the relevant data from within the 
PDF document in the search result.


I hope I have explained my problem properly, I am posting here as I have been 
aware for some time that Tika was possibly the solution but I am only just 
getting round to working on this now.

Does anyone have a suggestion of how I can implement this in Nutch 1.2. Thank 
you

Lewis

Glasgow Caledonian University is a registered Scottish charity, number SC021474

Winner: Times Higher Education’s Widening Participation Initiative of the Year 
2009 and Herald Society’s Education Initiative of the Year 2009
http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html

PDF Content Extraction

Reply via email to