try again zipping the files. after i post the files in the web site.
> Could you also tell us a bit about this code? Is it better than > existing PDF/Word parsing solutions? Pure Java? Uses POI? This code use existing parsing solution. The intent is make a lucene Document for index pdf and word files, with content. Is pure java. Use TextExtraction library. tm-extractors-0.2.jar Use POI and PDFBox. Ernesto Sorry for my bad English. > > Thanks, > Otis > > > --- Ernesto De Santis <[EMAIL PROTECTED]> wrote: > > Classes for index Pdf and word files in lucene. > > Ernesto. > > > > ----- Original Message ----- > > From: "Ernesto De Santis" <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Wednesday, October 29, 2003 12:04 PM > > Subject: Re: [opencms-dev] Index pdf files with your content in > > lucene. > > > > > > Hello all, > > > > Thans very much Stephan for your valuable help. > > Attached you will find the PDFDocument, and WordDocument class source > > code > > > > Ernesto. > > > > > > ----- Original Message ----- > > From: "Hartmann, Waehrisch & Feykes GmbH" > > <[EMAIL PROTECTED]> > > To: <[EMAIL PROTECTED]> > > Sent: Tuesday, October 28, 2003 11:10 AM > > Subject: Re: [opencms-dev] Index pdf files with your content in > > lucene. > > > > > > > Hi Ernesto, > > > > > > the IndexManager retrieves a list of files of a folder by calling > > the > > method > > > getFilesInFolder of CmsObject. This method returns only empty > > files, i.e. > > > with empty content. To get the content of a pdf file you have to > > reread > > the > > > file: > > > f = cms.readFile(f.getAbsolutePath()); > > > > > > Bye, > > > Stephan > > > > > > Am Montag, 27. Oktober 2003 19:18 schrieben Sie: > > > > > > > > Hello > > > > > > > > Thanks for the previous reply. > > > > > > > > Now, i use > > > > - version 1.4 of lucene searche module. (the version attached in > > this > > list) > > > > - new version of registry.xml format for module. (like you write > > me) > > > > - the pdf files are stored with the binary type. > > > > > > > > But i have the next problem: > > > > i can´t make a InputStream for the cmsfile content. > > > > For this i write this code in de Document method of my class > > PDFDocument: > > > > > > > > ----------------- > > > > > > > > InputStream in = new ByteArrayInputStream(f.getContents()); //f > > is the > > > > parameter CmsFile of the Document method > > > > > > > > PDFExtractor extractor = new PDFExtractor(); //PDFExtractor is > > lib i > > use. > > > > in file system work fine. > > > > > > > > > > > > bodyText = extractor.extractText(in); > > > > > > > > ---------------- > > > > > > > > Is correct use ByteArrayInputStream for make a InputStream for a > > CmsFile? > > > > > > > > The error ocurr in the third line. > > > > In the PDFParcer. > > > > the error menssage in tomcat is: > > > > > > > > java.io.IOException: Error: Header is corrupt '' > > > > at PDFParcer.parse > > > > at PDFExtractor.extractText > > > > at PDFDocument.Document (my class) > > > > at..... > > > > > > > > By, and thanks. > > > > Ernesto. > > > > > > > > > > > > ----- Original Message ----- > > > > From: Hartmann, Waehrisch & Feykes GmbH > > > > To: [EMAIL PROTECTED] > > > > Sent: Friday, October 24, 2003 4:45 AM > > > > Subject: Re: [opencms-dev] Index pdf files with your content in > > lucene. > > > > > > > > > > > > Hello Ernesto, > > > > > > > > i assume you are using the unpatched version 1.3 of the search > > module. > > > > As i mentioned yesterday, the plainDocFactory does only index > > cmsFiles > > of > > > > type "plain" but not of type "binary". PDF files are stored as > > binary. I > > > > suggest to use the version i posted yesterday. Then your > > registry.xml > > would > > > > have to look like this: ... > > > > <docFactories> > > > > ... > > > > <docFactory type="plain" enabled="true"> > > > > ... > > > > </docFactory> > > > > <docFactory type="binary" enabled="true"> > > > > <fileType name="pdftext"> > > > > <extension>.pdf</extension> > > > > > > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class> > > > > </fileType> > > > > </docFactory> > > > > ... > > > > </docFactories> > > > > > > > > Important: The type attribute must match the file types of > > OpenCms > > (also > > > > defined in the registry.xml). > > > > > > > > Bye, > > > > Stephan > > > > > > > > ----- Original Message ----- > > > > From: Ernesto De Santis > > > > To: Lucene Users List > > > > Cc: [EMAIL PROTECTED] > > > > Sent: Thursday, October 23, 2003 4:16 PM > > > > Subject: [opencms-dev] Index pdf files with your content in > > lucene. > > > > > > > > > > > > Hello > > > > > > > > I am new in opencms and lucene tecnology. > > > > > > > > I won index pdf files, and index de content of this files. > > > > > > > > I work in this way: > > > > > > > > Make a PDFDocument class like JspDocument class. > > > > use org.textmining.text.extraction.PDFExtractor class, this > > class > > work > > > > fine out of vfs. > > > > > > > > and write my registry.xml for pdf document, in > > plainDocFactory tag. > > > > > > > > <fileType name="pdftext"> > > > > <extension>.pdf</extension> > > > > <!-- This will strip tags before > > processing --> > > > > > > > > <class>net.grcomputing.opencms.search.lucene.PDFDocument</class> > > > > </fileType> > > > > > > > > my PDFDocument content this code: > > > > I think that the probrem is how take the content from > > CmsFile?, what > > > > InputStream use? PDFExtractor work with extractText(InputStream) > > method. > > > > > > > > public class PDFDocument implements I_DocumentConstants, > > > > I_DocumentFactory { > > > > > > > > public PDFDocument(){ > > > > > > > > } > > > > > > > > > > > > public Document Document(CmsObject cmsobject, CmsFile > > cmsfile) > > > > > > > > throws CmsException > > > > > > > > { > > > > > > > > return Document(cmsobject, cmsfile, null); > > > > > > > > } > > > > > > > > public Document Document(CmsObject cmsobject, CmsFile > > cmsfile, > > HashMap > > > > hashmap) > > > > > > > > throws CmsException > > > > > > > > { > > > > > > > > Document document=(new > > BodylessDocument()).Document(cmsobject, > > > > cmsfile); > > > > > > > > > > > > //put de content in the pdf file. > > > > > > > > String contenido = new String(cmsfile.getContents()); > > > > > > > > StringBufferInputStream in = new > > StringBufferInputStream(contenido); > > > > > > > > // ByteArrayInputStream in = new > > > > ByteArrayInputStream(contenido.getBytes()); > > > > > > > > > > > > /* try{ > > > > > > > > FileInputStream in = new FileInputStream (cmsfile.getPath() + > > > > cmsfile.getName()); > > > > > > > > */ > > > > > > > > PDFExtractor extractor = new PDFExtractor(); > > > > > > > > String body = extractor.extractText(in); > > > > > > > > > > > > document.add(Field.Text("body", body)); > > > > > > > > /* }catch(FileNotFoundException e){ > > > > > > > > e.toString(); > > > > > > > > throw new CmsException(); > > > > > > > > } > > > > > > > > > > > > */ > > > > > > > > return (document); > > > > > > > > } > > > > > > > > > > > > thanks > > > > Ernesto > > > > PD: Sorry for my poor english. > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > From: "Hartmann, Waehrisch & Feykes GmbH" > > > > <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> > > > > Sent: Wednesday, October 22, 2003 3:50 AM > > > > Subject: Re: [opencms-dev] (no subject) > > > > > > > > > Hi Ben, > > > > > > > > > > i think this won't work since the plainDocFactory will only > > be > > used > > > > > for files of type "plain" but not for files of type > > "binary". > > > > > Recently we have done some additions to the module - by > > order of > > > > > Lenord, Bauer & Co. GmbH - that could meet your needs. It > > introduces > > > > > a more flexible way of defining docFactories that you can > > add new > > > > > factories without having to recompile the whole module. So > > other > > > > > modules (like the news) can bring their own docFactory and > > all you > > > > > have to do is to edit the registry.xml. Here is an example: > > > > > > > > > > <docFactories> > > > > > <docFactory enabled="true" type="plain"> > > > > > <fileType name="plaintext"> > > > > > <extension>.txt</extension> > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class> > > > > > </fileType> > > > > > </docFactory> > > > > > <docFactory enabled="true" type="news"> > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.NewsDocument</class> > > > > > </docFactory> > > > > > </docFactories> > > > > > > > > > > To index binary files all you need to add is this: > > > > > > > > > > <docFactory enabled="true" type="binary"> > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</class> > > > > > </docFactory> > > > > > > > > > > There should be no need for an extension mapping. > > > > > > > > > > For the interested people: > > > > > For ContentDefinitions (like news) i introduced the > > following: > > > > > <contentDefinitions> > > > > > <contentDefinition type="news"> <!-- must > > match > > > > > docFactory type --> > > > > > > > > > > > > <class>com.opencms.modules.homepage.news.NewsContentDefinition</class > > > > >> > > > > > > > > > > > > <initClass>net.grcomputing.opencms.search.lucene.NewsInitialization</ > > > > >initCla ss> > > > > > <listMethod name="getNewsList"> > > > > > <param > > type="java.lang.Integer">1</param> > > > > > <param > > type="java.lang.String">-1</param> > > > > > </listMethod> > > > > > <page uri="/news.html?__element=entry"> > > > > > <param method="getIntId" > > name="newsid"/> > > > > > </page> > > > > > </contentDefinition> > > > > > > > > > > In short: > > > > > initClass is optional: For the news the news classes have > > to be > > > > > loaded to initialize the db pool. > > > > > listMethod: a method of the content definition class that > > returns > > a > > > > > List of elements > > > > > page: the page that can display an entry. Here a jsp that > > has a > > > > > template element "entry". It also needs the id of the news > > item. > > > > > getIntId is a method of the content definition class and > > newsid is > > > > > the url parameter the page needs. A link like > > > > > news.html?__element=entry&newsid=xy > > > > > will be generated. > > > > > > > > > > Best regards, > > > > > Stephan > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Ben Rometsch" <[EMAIL PROTECTED]> > > > > > To: <[EMAIL PROTECTED]> > > > > > Sent: Wednesday, October 22, 2003 6:15 AM > > > > > Subject: [opencms-dev] (no subject) > > > > > > > > > > > Hi Matt, > > > > > > > > > > > > I am not having any joy! I've updated my registry.xml > > file, with > > > > > > the appropriate section reading: > > > > > > > > > > > > <luceneSearch> > > > > > > <mergeFactor>100000</mergeFactor> > > > > > > <permCheck>true</permCheck> > > > > > > <indexDir>c:\search</indexDir> > > > > > > > > > > > > > > <analyzer>org.apache.lucene.analysis.standard.StandardAnalyzer</ana > > > > > >lyzer> <subsearch>true</subsearch> > > > > > > <project>online</project> > > > > > > <docFactories> > > > > > > <pageDocFactory enabled="true"> > > > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.PageDocument</class> > > > > > > </pageDocFactory> > > > > > > <plainDocFactory enabled="true"> > > > > > > <fileType name="plaintext"> > > > > > > <extension>.txt</extension> > > > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.PlainDocument</class> > > > > > > </fileType> > > > > > > <fileType name="taggedtext"> > > > > > > <extension>.html</extension> > > > > > > <extension>.htm</extension> > > > > > > <extension>.xml</extension> > > > > > > <!-- This will strip tags before processing > > > > > > --> > > > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.TaggedPlainDocument</c > > > > > >lass> </fileType> > > > > > > > > > > > > <!-- Index binary documents --> > > > > > > <fileType name="plaindocument"> > > > > > > <extension>.doc</extension> > > > > > > <extension>.xls</extension> > > > > > > <extension>.pdf</extension> > > > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.BodylessDocument</clas > > > > > >s> </fileType> > > > > > > > > > > > > </plainDocFactory> > > > > > > <jspDocFactory enabled="true"> > > > > > > > > > > > > > > <class>net.grcomputing.opencms.search.lucene.JspDocument</class> > > > > > > </jspDocFactory> > > > > > > <xmlTemplateDocFactory enabled="false"/> > > > > > > </docFactories> > > > > > > <directories> > > > > > > <directory location="/release/"> > > > > > > <section>Test</section> > > > > > > <subsearch>true</subsearch> > > > > > > </directory> > > > > > > <directory location="/RGLIntranet/"> > > > > > > <section>Test2</section> > > > > > > <subsearch>true</subsearch> > > > > > > </directory> > > > > > > </directories> > > > > > > </luceneSearch> > > > > > > > > > > > > Notice the section beginning after the remark "Index > > binary > > > > > > documents". > > > > > > > > > > > > But I cannot get any hits when searching for document > > names that > > > > > > are in > > > > > > > > > > the > > > > > > > > > > > VFS. The other (HTML) searches are working ok. Is the > > "name" > > > > > > property of > > > > > > > > > > the > > > > > > > > > > > fileType tag important? I wasn't sure what to add > > here...I'm not > > > > > > quite > > > > > > > > > > sure > > > > > > > > > > > how to move forward. Maybe it would be an idea to add > > some > > > > > > debugging trace to the BodylessDocument class to see what > > is > > going > > > > > > on inside it? I want to make sure my XML is correct first > > tho! > > > > > > > > > > > > Thanks for the help, > > > > > > Ben > > > > > > > > > > > > On Thu, 2003-10-16 at 22:46, Ben Rometsch wrote: > > > > > > > Hi Matt, > > > > > > > > > > > > > > Thanks for the reply. If I just want to get the > > document title > > to > > > > > > > be included in the Lucene index, looking at the code in > > the > > > > > > > net.grcomputing.opencms.search.BodylessDocument class > > it > > appears > > > > > > > to > > > > > > > > > > ignore > > > > > > > > > > > > what the CMSObject is, and attempt to index it > > regardless. Is > > > > > > > this > > > > > > > > > > > > correct? > > > > > > > > > > > > > > > > > > Correct. It will already index the title, but it will not > > attempt > > > > > > to index the body. > > > > > > > > > > > > > If this is the case, is it simply a matter of > > instructing > > Lucene > > > > > > > to > > > > > > > > > > index > > > > > > > > > > > > obects other than HTML files in the VFS (i.e. > > Documents) ? Or > > > > > > > would I > > > > > > > > > > > > have > > > > > > > > > > > > > to create another class, something like > > > > > > > net.grcomputing.opencms.search.FileDocument and add a > > new hook > > > > > > > into that class via the registry.xml fragment? Or does > > the > > > > > > > BodyLess document > > > > > > > > > > > > provide > > > > > > > > > > > > > this functionality, and it's just a matter of adding a > > new XML > > > > > > > fragment > > > > > > > > > > to > > > > > > > > > > > > the registry.xml are? > > > > > > > > > > > > Again, you are right -- simply adding the appropriate > > configuration > > > > > > to the registry.xml file will suffice. I believe that you > > will > > just > > > > > > need to extend the plainDocument tag set to include > > extensions > > and > > > > > > processors... I _think_ that binary files get handled by > > the > > plain > > > > > > handler. > > > > > > > > > > > > Matt > > > > > > > > > > > > _______________________________________________ > > > > > > This mail is send to you from the opencms-dev mailing > > list > > > > > > To change your list options, or to unsubscribe from the > > list, > > > > > > please visit > > http://mail.opencms.org/mailman/listinfo/opencms-dev > > > > > > > > > > Stephan Hartmann > > > > > Unternehmensberatung Währisch & Feykes GmbH > > > > > Gustav-Adolf-Str. 5 > > > > > 47057 Duisburg > > > > > > > > > > Tel.: 0203-373070 > > > > > Fax: 0203-376766 > > > > > E-Mail: [EMAIL PROTECTED] > > > > > Internet: www.wfnetz.de > > > > > > > > > > Über das Internet versandte E-Mails können unter fremden > > Namen > > > > > erstellt oder manipuliert werden. Aus diesem Grund > > enthalten > > unsere > > > > > mit E-Mail verschickten Nachrichten grundsätzlich keine > > > > > rechtsverbindlichen Willenserklärungen. > > > > > > ---------------------------------------- > > > Content-Type: text/html; charset="iso-8859-1"; name="Anhang: 1" > > > Content-Transfer-Encoding: quoted-printable > > > Content-Description: > > > ---------------------------------------- > > > > > > -- > > > Stephan Hartmann > > > > > > Währisch & Feykes GmbH > > > Gustav-Adolf-Str. 5 > > > 47057 Duisburg > > > Tel. 0203 / 373 070 > > > Fax 0203 / 376 766 > > > [EMAIL PROTECTED] > > > > > > ------------------------------------------------------ > > > Ausschlusserklärung (Disclaimer): > > > Über das Internet versandte E-mails können unter fremden Namen > > erstellt > > oder > > > manipuliert werden. Aus diesem Grund enthalten unsere mit E-mail > > verschickten > > > Nachrichten grundsätzlich keine rechtsverbindlichen > > Willenserklärungen. > > > _______________________________________________ > > > This mail is send to you from the opencms-dev mailing list > > > To change your list options, or to unsubscribe from the list, > > please visit > > > http://mail.opencms.org/mailman/listinfo/opencms-dev > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > __________________________________ > Do you Yahoo!? > Protect your identity with Yahoo! Mail AddressGuard > http://antispam.yahoo.com/whatsnewfree > > >
/* ==================================================================== * The TextMining.Org Software License, Version 1.1 * * Copyright (c) 2002 Ryan Ackley All rights * reserved. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in * the documentation and/or other materials provided with the * distribution. * * 3. The end-user documentation included with the redistribution, * if any, must include the following acknowledgment: * "This product includes software developed by * TextMining.Org (http://www.textmining.org/)." * Alternately, this acknowledgment may appear in the software itself, * if and wherever such third-party acknowledgments normally appear. * * 4. The name "TextMining.Org" must not be used to endorse or promote products * derived from this software without prior written permission. For * written permission, please contact [EMAIL PROTECTED] * * 5. Products derived from this software may not be called "TextMining.org", * "TextMining.org extractors", nor may "TextMining.org" appear in their name, without * prior written permission of the Apache Software Foundation. * * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * ==================================================================== * * This software consists of voluntary contributions made by many * individuals on behalf of the Apache Software Foundation. For more * information on the Apache Software Foundation, please see * <http://www.apache.org/>. */
--------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]