Hi, What i find is parser is working fine since it is searching for file extension as *.doc, only the problem with content of doc files, So try to find content from the Contents which u get either from NutchBean or search.jsp .
if you are able to get the Contents then change your search.jsp accordingly or else try to see the msword-plugin.xml file for changes. Regards "Ratnesh,V2Solutions India" Stephen Wilkinson wrote: > > if I do a search on *.doc it returns about 7 files. if I do a search on > something that should be in a word doc, it doesn't return anything. > > reading the wiki, I haven't got anything in nutch-site.xml, all the > parse ones are in parse-plugins.xml > > should I have things in nutch-site.xml and if so, what is the xml > syntax for crawling word docs etc? > > thanks > > Steve > > > 'This e-mail and any files transmitted with it are confidential and > intended solely for the use of the individual or entity to whom they are > addressed. If you have received this e-mail in error please notify North > Devon District Council Information Systems. E-mail is inherently insecure > without specific security measures being taken. In essence we cannot > guarantee the safe and private delivery of all e-mail, both outbound and > inbound, due to the complexity and nature of the networks that it may > utilise. Please bear this in mind when sending critical or sensitive > information. The views in this message are personal and are not > necessarily those of North Devon District Council. > Senders and recipients of email should be aware that under UK Data > Protection and Freedom of Information legislation these contents may have > to be disclosed in response to a request. Under the Regulation of > Investigatory Powers Act 2000, Lawful Business Practice Regulations, any > E-mail sent to or from this address may be accessed by someone other than > the recipient for system management and security purposes.' > _______________________________________________________________________ > This e-mail has been scanned for all viruses by Star Internet. The > service is powered by MessageLabs. For more information on a proactive > anti-virus service working around the clock, around the globe, visit: > http://www.star.net.uk > ________________________________________________________________________ > -- View this message in context: http://www.nabble.com/having-problems-with-search-reading-word-docs-and-pdf%27s-in-0.8.1-tf3607482.html#a10092090 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
