Hi Emyr, You could try using the "extractOnly=true" parameter [1]. Of course, you'll need to repost the extracted text manually.
--jay [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only On Thu, May 5, 2011 at 9:36 AM, Emyr James <emyr.ja...@sussex.ac.uk> wrote: > Hi All, > > I have solr and tika installed and am happily extracting and indexing > various files. > Unfortunately on some word documents it blows up since it tries to > auto-generate a 'title' field but my title field in the schema is single > valued. > > Here is my config for the extract handler... > > <requestHandler name="/update/extract" > class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> > <lst name="defaults"> > <str name="uprefix">ignored_</str> > </lst> > </requestHandler> > > Is there a config option to make it only extract text, or ideally to allow > me to specify which metadata fields to accept ? > > E.g. I'd like to use any author metadata it finds but to not use any title > metadata it finds as I want title to be single valued and set explicitly > using a literal.title in the post request. > > I did look around for some docs but all i can find are very basic examples. > there's no comprehensive configuration documentation out there as far as I > can tell. > > > ALSO... > > I get some other bad responses coming back such as... > > <html><head><title>Apache Tomcat/6.0.28 - Error report</title><style><!--H1 > {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;} > H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:# > 525D76;font-size:16px;} H3 > {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;} > BODY > {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B > {font-family:Tahoma,Arial,sans-serif;c > olor:white;background-color:#525D76;} P > {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A > {color : black;}A.name {color : black;}HR {color : #525D76;}--></style> > </head><body><h1>HTTP Status 500 - org.ap > ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; > > java.lang.NoSuchMethodError: > org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; > at > org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148) > at > org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190) > at > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54) > at > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) > at > org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233) > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316) > at > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338) > at > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241) > at > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235) > at > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206) > at > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233) > at > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) > at > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) > at > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) > at > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) > at > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298) > at > org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857) > at > org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588) > at > org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489) > at java.lang.Thread.run(Thread.java:636) > </h1><HR size="1" noshade="noshade"><p><b>type</b> Status > report</p><p><b>message</b> > <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator; > > For the above my url was... > > http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not > es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten > > I guess there's something special I need to be able to process power point > files ? Maybe I need to get the latest apache POI ? Any suggestions > welcome... > > > Regards, > > Emyr >