Thank you Kevinchen for your tips, I already can parsing pdf and word now.
but in the search result when I click cached, the page will give a result
like this:
The cached content has mime type "application/pdf", click this
link<./servlet/cached?idx=0&id=55>to download it directly.
I want the res
You need to turn on two plugins, parse-pdf and parse-msword.;
Look at your ${NUTCH_HOME}/conf/nutch-site.xml, change property
"plugin.include"s:
for example:
plugin.includes
protocol-(httpclient|file)|urlfilter-(regex)|parse-(text|
html|js|pdf|msword)|index-(basic)|query-
(basic|
hi everybody,
I setup nuthc-0.9, and I can search txt and html in local system . Now i
want to search pdf and msword , can you tell me how to do?
BR,
mingkong
hi everybody,
I setup nuthc-0.9, and I can search txt and html in local system . Now i
want to search pdf and msword , can you tell me how to do?
BR,
mingkong
I meant that you could just do a http://external_url.com/y/z/
crawl . But yes, if you have pages from someone elses server locally,
you will need to rewrite the BASE component of the URL in the search
results.
For that you could probably just hack search.jsp (but dont tell
anyone I told you
Hello. I need to get the links followed by nutch to reach a page; something
like the anchors, but getting all the information inside the link instead of
the text of the link.
I don't know if this can be done building a plugin, or if I must modify the
Nutch code to get this information. I went thro