ps pls use [email protected] for non dev issues Regards Ard
On Thu, Dec 16, 2010 at 2:08 PM, Ard Schrijvers <[email protected]> wrote: > Hello, > > seems to me a pdfbox issue. What happens if you try a different pdf? > If other pdf's just work, and a single one fails, you can better post > the question to one of the pdfbox mailinglists: > http://pdfbox.apache.org/mail-lists.html > > Regards Ard > > On Thu, Dec 16, 2010 at 1:09 PM, Rojas Buitrago, Sergio <[email protected]> > wrote: >> Hello. >> >> >> >> I’m a newbie in Jackrabbit. >> >> >> >> I’m trying to index some content of different types of documents (word, pdf, >> xml, …). >> >> >> >> I’ve configured the searchIndex in my workspace.xml in this way: >> >> >> >> <SearchIndex class="org.apache.jackrabbit.core.query.lucene.SearchIndex"> >> >> <param name="path" value="${wsp.home}/index"/> >> >> <param name="supportHighlighting" value="true"/> >> >> <param >> name="textFilterClasses" >> value="org.apache.jackrabbit.extractor.MsWordTextExtractor, >> >> >> org.apache.jackrabbit.extractor.MsExcelTextExtractor, >> >> >> org.apache.jackrabbit.extractor.MsPowerPointTextExtractor, >> >> >> org.apache.jackrabbit.extractor.PdfTextExtractor, >> >> >> org.apache.jackrabbit.extractor.OpenOfficeTextExtractor, >> >> >> org.apache.jackrabbit.extractor.RTFTextExtractor, >> >> >> org.apache.jackrabbit.extractor.HTMLTextExtractor, >> >> >> org.apache.jackrabbit.extractor.XMLTextExtractor"/> >> >> </SearchIndex> >> >> >> >> >> >> When I create a document in the repository, I add the content in this way: >> >> >> >> contenido = nodo.addNode("jcr:content", "nt:resource"); >> >> contenido.setProperty("jcr:data", J_OperacionesSesion >> >> .getValueFactory().createBinary(is)); >> >> >> >> MimetypesFileTypeMap mimetypes = new >> MimetypesFileTypeMap(); >> >> String mime = mimetypes.getContentType(nodo.getName()); >> >> contenido.setProperty("jcr:mimeType", "application/pdf"); >> >> >> >> Afer creating the document, this warning is thrown: >> >> >> >> 16.12.2010 13:03:32 *WARN * LazyTextExtractorField: Failed to extract text >> from a binary property (LazyTextExtractorField.java, line 180) >> >> org.apache.tika.exception.TikaException: Unable to extract PDF content >> >> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:61) >> >> at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:69) >> >> at >> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) >> >> at >> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) >> >> at >> org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189) >> >> at >> org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField$ParsingTask.run(LazyTextExtractorField.java:174) >> >> at >> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417) >> >> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269) >> >> at java.util.concurrent.FutureTask.run(FutureTask.java:123) >> >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:65) >> >> at >> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:168) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650) >> >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675) >> >> at java.lang.Thread.run(Thread.java:595) >> >> Caused by: org.apache.pdfbox.exceptions.WrappedIOException: >> OperatorProcessor class org.pdfbox.util.operator.ShowTextGlyph could not be >> instantiated >> >> at >> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:152) >> >> at >> org.apache.pdfbox.util.PDFTextStripper.<init>(PDFTextStripper.java:129) >> >> at org.apache.tika.parser.pdf.PDF2XHTML.<init>(PDF2XHTML.java:69) >> >> at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:56) >> >> ... 13 more >> >> Caused by: java.lang.ClassCastException: >> org.pdfbox.util.operator.ShowTextGlyph >> >> at >> org.apache.pdfbox.util.PDFStreamEngine.<init>(PDFStreamEngine.java:146) >> >> ... 16 more >> >> >> >> Later, when I search for the document, filtering by content, in this way: >> >> >> >> String consulta = "SELECT * FROM [arch:documento] AS documento WHERE >> CONTAINS ( documento.*, 'ubicacion')"; (arch:document extends from nt:file) >> >> >> >> No documents were found. >> >> >> >> >> >> Can you help me please??. >> >> >> >> >> >> Thanks and regards. >> >> >> >> >> >> >> >> >> >> >> >> ________________________________ >> Este correo electrónico y, en su caso, cualquier fichero anexo al mismo, >> contiene información de carácter confidencial exclusivamente dirigida a su >> destinatario o destinatarios. Si no es vd. el destinatario indicado, queda >> notificado que la lectura, utilización, divulgación y/o copia sin >> autorización está prohibida en virtud de la legislación vigente. En el caso >> de haber recibido este correo electrónico por error, se ruega notificar >> inmediatamente esta circunstancia mediante reenvío a la dirección >> electrónica del remitente. >> Evite imprimir este mensaje si no es estrictamente necesario. >> >> This email and any file attached to it (when applicable) contain(s) >> confidential information that is exclusively addressed to its recipient(s). >> If you are not the indicated recipient, you are informed that reading, >> using, disseminating and/or copying it without authorisation is forbidden in >> accordance with the legislation in effect. If you have received this email >> by mistake, please immediately notify the sender of the situation by >> resending it to their email address. >> Avoid printing this message if it is not absolutely necessary. >> > > > > -- > Hippo > Europe • Amsterdam Oosteinde 11 • 1017 WT Amsterdam • +31 (0)20 522 > 4466 > USA • San Francisco 755 Baywood Drive, Second Floor • Petaluma, CA. > 94954 • +1 877 414 4776 (toll free) > Canada • Montréal 5369 Boulevard St-Laurent #430 • Montréal QC > H2T 1S5 • +1 (514) 316 8966 > www.onehippo.com • www.onehippo.org • [email protected] > ________________________________________________________________ > This e-mail may be privileged and/or confidential, and the sender does > not waive any related rights and obligations. Any distribution, use or > copying of this e-mail or the information it contains by other than an > intended recipient is unauthorized. If you received this e-mail in > error, please advise me (by return e-mail or otherwise) immediately. > -- Hippo Europe • Amsterdam Oosteinde 11 • 1017 WT Amsterdam • +31 (0)20 522 4466 USA • San Francisco 755 Baywood Drive, Second Floor • Petaluma, CA. 94954 • +1 877 414 4776 (toll free) Canada • Montréal 5369 Boulevard St-Laurent #430 • Montréal QC H2T 1S5 • +1 (514) 316 8966 www.onehippo.com • www.onehippo.org • [email protected] ________________________________________________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately.
