Thanks again Stefan, good idea and I may create a jira issue as other people seem to have run into the same or similar issue.
-- John On Fri, 2010-07-23 at 03:48 -0400, Stefan Guggisberg wrote: > On Thu, Jul 22, 2010 at 10:48 PM, John Langley > <[email protected]> wrote: > > Thanks that definitely helped! > > > > But now I get the following errors / warnings: > > [#|2010-07-22T20:27:48.722+0000|WARNING|sun-appserver2.1| > > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField| > > _ThreadID=16;_ThreadName=jackrabbit-pool-4;_RequestID=7c74399a-6822-4f8b-b6e1-fcc54e5c37f8;|Failed > > to extract text from a binary property > > org.apache.tika.exception.TikaException: TIKA-237: Illegal SAXException > > from org.apache.tika.parser.xml.dcxmlpar...@85deafc > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:130) > > at > > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:101) > > at > > org.apache.jackrabbit.core.query.lucene.JackrabbitParser.parse(JackrabbitParser.java:189) > > at > > org.apache.jackrabbit.core.query.lucene.LazyTextExtractorField > > $ParsingTask.run(LazyTextExtractorField.java:174) > > at java.util.concurrent.Executors > > $RunnableAdapter.call(Executors.java:441) > > at java.util.concurrent.FutureTask > > $Sync.innerRun(FutureTask.java:303) > > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > > at java.util.concurrent.ScheduledThreadPoolExecutor > > $ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:98) > > at java.util.concurrent.ScheduledThreadPoolExecutor > > $ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:207) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.runTask(ThreadPoolExecutor.java:886) > > at java.util.concurrent.ThreadPoolExecutor > > $Worker.run(ThreadPoolExecutor.java:908) > > at java.lang.Thread.run(Thread.java:619) > > Caused by: org.xml.sax.SAXParseException: The version is required in the > > XML declaration. > > at > > org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown > > Source) > > at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown > > Source) > > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown > > Source) > > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown > > Source) > > at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown > > Source) > > at > > org.apache.xerces.impl.XMLScanner.scanXMLDeclOrTextDecl(Unknown Source) > > at > > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanXMLDeclOrTextDecl(Unknown > > Source) > > at org.apache.xerces.impl.XMLDocumentScannerImpl > > $XMLDeclDispatcher.dispatch(Unknown Source) > > at > > org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown > > Source) > > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > > Source) > > at org.apache.xerces.parsers.XML11Configuration.parse(Unknown > > Source) > > at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) > > at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown > > Source) > > at org.apache.xerces.jaxp.SAXParserImpl > > $JAXPSAXParser.parse(Unknown Source) > > at org.apache.xerces.jaxp.SAXParserImpl.parse(Unknown Source) > > at javax.xml.parsers.SAXParser.parse(SAXParser.java:198) > > at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) > > at > > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) > > ... 11 more > > |#] > > > > Note: this only happens when I put a "file" in via webdav and the file > > has an .xml extension but is empty (which is a temporary state in our > > application) > > > > Is there anything I can or should do (other than tweaking the logging > > properties) to turn off this warning? > > the warning is IMO legitimate (trying to index a zero-length file). > however, the length of the file could probably be checked > by o.a.jackrabbit.core.query.lucene.LazyTextExtractorField > before handing it over to TIKA and a less verbose warning > could be logged if the file is empty. feel free to create a > jira issue if it really bugs you. > > cheers > stefan > > > > > Thanks in advance, the first suggestion was great! > > > > -- Langley > > > > > > On Thu, 2010-07-22 at 10:36 -0400, Stefan Guggisberg wrote: > > > >> this might help: > >> http://markmail.org/message/hctkq6looial7xzr > >> > >> cheers > >> stefan > >> > >> On Thu, Jul 22, 2010 at 4:08 PM, John Langley > >> <[email protected]> wrote: > >> > We recently upgraded from jackrabbit 2.0 to jackrabbit 2.1 and > >> > discovered much to our chagrin that storing xml content in the > >> > repository has been significantly changed. In fact, from our point of > >> > view it has been broken! > >> > > >> > Previously, we had been storing xml content via a webdav client into the > >> > repository and everything was fine. Now when we try to do this, the > >> > result is that the content length of these xml files (regardless of > >> > whether the "file" has a .xml extension or not) is 0 length! > >> > > >> > Certainly there MUST be a configuration that can help us "turn off" any > >> > special processing of xml content that we store in the repository. Could > >> > someone please point this out? > >> > > >> > Thanks in advance! > >> > > >> > -- Langley > >> > > >> > > >> > > >
