Hello list, I have been using Nutch 1.2 to crawl the web for a small number of very relevant html pages and associated URL’s containing PDF document’s. I have then been using Luke v 1.0.1 to look inside my index to guarantee I have indexed specific PDF documents which reside on these web pages. When I search my index I am returned a hyperlink (amongst other information) for a relevant hit. It is my intention to implement a content extraction mechanism to also provide relevant information contained within the pdf documents which reside in my index whenever a user submits a query. E.g. if someone were to submit a query relating to a clause within a legal document, the content extraction tool would parse the pdf file and provide a snippet of the relevant data from within the PDF document in the search result.
I hope I have explained my problem properly, I am posting here as I have been aware for some time that Tika was possibly the solution but I am only just getting round to working on this now. Does anyone have a suggestion of how I can implement this in Nutch 1.2. Thank you Lewis Glasgow Caledonian University is a registered Scottish charity, number SC021474 Winner: Times Higher Education’s Widening Participation Initiative of the Year 2009 and Herald Society’s Education Initiative of the Year 2009 http://www.gcu.ac.uk/newsevents/news/bycategory/theuniversity/1/name,6219,en.html