On Thu, Jul 22, 2010 at 10:48 PM, John Langley <[email protected]> wrote: > Thanks that definitely helped! > > But now I get the following errors / warnings: > [#|2010-07-22T20:27:48.722+0000|WARNING|sun-appserver2.1| > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField| > _ThreadID=16;_ThreadName=jackrabbit-pool-4;_RequestID=7c74399a-6822-4f8b-b6e1-fcc54e5c37f8;|Failed > to extract text from a binary property > org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException > from org.apache.tika.parser.xml.dcxmlpar...@85deafc > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:130) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) > at > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189) > at > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField > $ParsingTask.run(LazyTextExtractorField.java:174) > at java.util.concurrent.Executors > $RunnableAdapter.call(Executors.java:441) > at java.util.concurrent.FutureTask > $Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.util.concurrent.ScheduledThreadPoolExecutor > $ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) > at java.util.concurrent.ScheduledThreadPoolExecutor > $ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207) > at java.util.concurrent.ThreadPoolExecutor > $Worker.runTask(ThreadPoolExecutor.java:886) > at java.util.concurrent.ThreadPoolExecutor > $Worker.run(ThreadPoolExecutor.java:908) > at java.lang.Thread.run(Thread.java:619) > Caused by: org.xml.sax.SAXParseException: The version is required in the > XML declaration. > at > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > Source) > at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown > Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown > Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown > Source) > at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown > Source) > at > org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown > Source) > at org.apache.xerces.impl.XMLDocumentScannerImpl > $XMLDeclDispatcher.dispatch(Unknown Source) > at > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > Source) > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > Source) > at org.apache.xerces.jaxp.SAXParserImpl > $JAXPSAXParser.parse(Unknown Source) > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) > ... 11 more > |#] > > Note: this only happens when I put a "file" in via webdav and the file > has an .xml extension but is empty (which is a temporary state in our > application) > > Is there anything I can or should do (other than tweaking the logging > properties) to turn off this warning?
the warning is IMO legitimate (trying to index a zero-length file). however, the length of the file could probably be checked by o.a.jackrabbit.core.query.lucene.LazyTextExtractorField before handing it over to TIKA and a less verbose warning could be logged if the file is empty. feel free to create a jira issue if it really bugs you. cheers stefan > > Thanks in advance, the first suggestion was great! > > -- Langley > > > On Thu, 2010-07-22 at 10:36 -0400, Stefan Guggisberg wrote: > >> this might help: >> http://markmail.org/message/hctkq6looial7xzr >> >> cheers >> stefan >> >> On Thu, Jul 22, 2010 at 4:08 PM, John Langley >> <[email protected]> wrote: >> > We recently upgraded from jackrabbit 2.0 to jackrabbit 2.1 and >> > discovered much to our chagrin that storing xml content in the >> > repository has been significantly changed. In fact, from our point of >> > view it has been broken! >> > >> > Previously, we had been storing xml content via a webdav client into the >> > repository and everything was fine. Now when we try to do this, the >> > result is that the content length of these xml files (regardless of >> > whether the "file" has a .xml extension or not) is 0 length! >> > >> > Certainly there MUST be a configuration that can help us "turn off" any >> > special processing of xml content that we store in the repository. Could >> > someone please point this out? >> > >> > Thanks in advance! >> > >> > -- Langley >> > >> > >> > >
