Hi,
What i find is parser is working fine since it is searching for file
extension as *.doc, only the problem with content of doc files,  So try to
find content from the Contents which u get either from NutchBean or
search.jsp .

if you are able to get the Contents then change your search.jsp accordingly
or else try to see the msword-plugin.xml file for changes.


Regards
"Ratnesh,V2Solutions India"


Stephen Wilkinson wrote:
> 
> if I do a search on *.doc it returns about 7 files. if I do a search on
> something that should be in a word doc, it doesn't return anything.
>  
> reading the wiki, I haven't got anything in nutch-site.xml, all the
> parse ones are in parse-plugins.xml
>  
> should I have things in nutch-site.xml and if so, what is the xml
> syntax for crawling word docs etc?
>  
> thanks
>  
> Steve
> 
> 
> 'This e-mail and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they are
> addressed.  If you have received this e-mail in error please notify North
> Devon District Council Information Systems. E-mail is inherently insecure
> without specific security measures being taken.  In essence  we cannot
> guarantee the safe and private delivery of all e-mail, both outbound and
> inbound, due to the complexity and nature of the networks that it may
> utilise. Please bear this in mind when sending critical or sensitive
> information. The views in this message are personal and are not
> necessarily those of North Devon District Council. 
> Senders and recipients of email should be aware that under UK Data
> Protection and Freedom of Information legislation these contents may have
> to be disclosed in response to a request. Under the Regulation of
> Investigatory Powers Act 2000, Lawful Business Practice Regulations, any
> E-mail sent to or from this address may be accessed by someone other than
> the recipient for system management and security purposes.'
> _______________________________________________________________________
> This e-mail has been scanned for all viruses by Star Internet. The
> service is powered by MessageLabs. For more information on a proactive
> anti-virus service working around the clock, around the globe, visit:
> http://www.star.net.uk
> ________________________________________________________________________
> 

-- 
View this message in context: 
http://www.nabble.com/having-problems-with-search-reading-word-docs-and-pdf%27s-in-0.8.1-tf3607482.html#a10092090
Sent from the Nutch - User mailing list archive at Nabble.com.


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to