Hi Emyr,

You can try the XPath based approach and see if that works. Also, see if
dynamic fields can help you for the meta data fields.



On Thu, May 5, 2011 at 7:28 PM, Emyr James <emyr.ja...@sussex.ac.uk> wrote:

> Thanks for the suggestion but there surely must be a better way than that
> to do it ?
> I don't want to post the whole file up, get it extracted on the server,
> send the extracted text back to the client then send it all back up to the
> server again as plain text.
> On 05/05/11 14:55, Jay Luker wrote:
>> Hi Emyr,
>> You could try using the "extractOnly=true" parameter [1]. Of course,
>> you'll need to repost the extracted text manually.
>> --jay
>> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>> On Thu, May 5, 2011 at 9:36 AM, Emyr James<emyr.ja...@sussex.ac.uk>
>>  wrote:
>>> Hi All,
>>> I have solr and tika installed and am happily extracting and indexing
>>> various files.
>>> Unfortunately on some word documents it blows up since it tries to
>>> auto-generate a 'title' field but my title field in the schema is single
>>> valued.
>>> Here is my config for the extract handler...
>>> <requestHandler name="/update/extract"
>>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>>> <lst name="defaults">
>>> <str name="uprefix">ignored_</str>
>>> </lst>
>>> </requestHandler>
>>> Is there a config option to make it only extract text, or ideally to
>>> allow
>>> me to specify which metadata fields to accept ?
>>> E.g. I'd like to use any author metadata it finds but to not use any
>>> title
>>> metadata it finds as I want title to be single valued and set explicitly
>>> using a literal.title in the post request.
>>> I did look around for some docs but all i can find are very basic
>>> examples.
>>> there's no comprehensive configuration documentation out there as far as
>>> I
>>> can tell.
>>> ALSO...
>>> I get some other bad responses coming back such as...
>>> <html><head><title>Apache Tomcat/6.0.28 - Error
>>> report</title><style><!--H1
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>>> 525D76;font-size:16px;} H3
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>>> BODY
>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>>> B
>>> {font-family:Tahoma,Arial,sans-serif;c
>>> olor:white;background-color:#525D76;} P
>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>>> </head><body><h1>HTTP Status 500 - org.ap
>>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>> java.lang.NoSuchMethodError:
>>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>    at
>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>>>    at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>    at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>    at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>>    at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>>>    at
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>    at
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>    at
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>    at
>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>    at
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>    at
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>    at
>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>>    at
>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>>    at
>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>>    at
>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>>    at
>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>>    at
>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>>    at
>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>>    at
>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>>    at
>>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>>    at
>>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>>>    at
>>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>>    at java.lang.Thread.run(Thread.java:636)
>>> </h1><HR size="1" noshade="noshade"><p><b>type</b>  Status
>>> report</p><p><b>message</b>
>>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>> For the above my url was...
>>> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>> I guess there's something special I need to be able to process power
>>> point
>>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>>> welcome...
>>> Regards,
>>> Emyr

Reply via email to