[Nutch-general] Nutch and fileparsers.

Gilbert Groenendijk Wed, 07 Feb 2007 01:53:31 -0800

HI,

Currently i have 2 questions about the fileformat parsers. I would like to
know how the PDF parser handles PDF files. Is it possible to split a PDF
page by page ? so if you find a match on a specific page, you can go to the
matched page like #page=12. The other question is about content 'filtering'
What happens if i index a Powerpoint with the header 'CompanyName
Presentation'? Basically the word Presentation is irrelevant but the
Companyname isn't. It is on every page which gives me 'Garbage' in the
index. Someone any thoughts about this? Thanks in advance.


--
Gilbert

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642

_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Nutch and fileparsers.

Reply via email to