Hi Emyr,

You can try the XPath based approach and see if that works. Also, see if
dynamic fields can help you for the meta data fields.

References-
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

Regards,
Anuj

On Thu, May 5, 2011 at 7:28 PM, Emyr James <emyr.ja...@sussex.ac.uk> wrote:

> Thanks for the suggestion but there surely must be a better way than that
> to do it ?
> I don't want to post the whole file up, get it extracted on the server,
> send the extracted text back to the client then send it all back up to the
> server again as plain text.
>
>
> On 05/05/11 14:55, Jay Luker wrote:
>
>> Hi Emyr,
>>
>> You could try using the "extractOnly=true" parameter [1]. Of course,
>> you'll need to repost the extracted text manually.
>>
>> --jay
>>
>> [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only
>>
>>
>> On Thu, May 5, 2011 at 9:36 AM, Emyr James<emyr.ja...@sussex.ac.uk>
>>  wrote:
>>
>>> Hi All,
>>>
>>> I have solr and tika installed and am happily extracting and indexing
>>> various files.
>>> Unfortunately on some word documents it blows up since it tries to
>>> auto-generate a 'title' field but my title field in the schema is single
>>> valued.
>>>
>>> Here is my config for the extract handler...
>>>
>>> <requestHandler name="/update/extract"
>>> class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
>>> <lst name="defaults">
>>> <str name="uprefix">ignored_</str>
>>> </lst>
>>> </requestHandler>
>>>
>>> Is there a config option to make it only extract text, or ideally to
>>> allow
>>> me to specify which metadata fields to accept ?
>>>
>>> E.g. I'd like to use any author metadata it finds but to not use any
>>> title
>>> metadata it finds as I want title to be single valued and set explicitly
>>> using a literal.title in the post request.
>>>
>>> I did look around for some docs but all i can find are very basic
>>> examples.
>>> there's no comprehensive configuration documentation out there as far as
>>> I
>>> can tell.
>>>
>>>
>>> ALSO...
>>>
>>> I get some other bad responses coming back such as...
>>>
>>> <html><head><title>Apache Tomcat/6.0.28 - Error
>>> report</title><style><!--H1
>>>
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
>>> H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
>>> 525D76;font-size:16px;} H3
>>>
>>> {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
>>> BODY
>>> {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
>>> B
>>> {font-family:Tahoma,Arial,sans-serif;c
>>> olor:white;background-color:#525D76;} P
>>>
>>> {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
>>> {color : black;}A.name {color : black;}HR {color : #525D76;}--></style>
>>> </head><body><h1>HTTP Status 500 - org.ap
>>> ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>
>>> java.lang.NoSuchMethodError:
>>>
>>> org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>    at
>>>
>>> org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
>>>    at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>    at
>>> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
>>>    at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
>>>    at
>>> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
>>>    at
>>>
>>> org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
>>>    at
>>>
>>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
>>>    at
>>>
>>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
>>>    at
>>>
>>> org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
>>>    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
>>>    at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
>>>    at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
>>>    at
>>>
>>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
>>>    at
>>>
>>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
>>>    at
>>>
>>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
>>>    at
>>>
>>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
>>>    at
>>>
>>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
>>>    at
>>>
>>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
>>>    at
>>>
>>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
>>>    at
>>>
>>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
>>>    at
>>>
>>> org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
>>>    at
>>>
>>> org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
>>>    at
>>> org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
>>>    at java.lang.Thread.run(Thread.java:636)
>>> </h1><HR size="1" noshade="noshade"><p><b>type</b>  Status
>>> report</p><p><b>message</b>
>>>
>>> <u>org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
>>>
>>> For the above my url was...
>>>
>>>
>>> http://localhost:8080/solr/update/extract?literal.id=3922&defaultField=content&fmap.content=content&uprefix=ignored_&stream.contentType=application%2Fvnd.ms-powerpoint&commit=true&literal.title=Reactor+cycle+141&literal.not
>>> es=&literal.tag=UCN_production&literal.author=Maurits+van+der+Grinten
>>>
>>> I guess there's something special I need to be able to process power
>>> point
>>> files ? Maybe I need to get the latest apache POI ? Any suggestions
>>> welcome...
>>>
>>>
>>> Regards,
>>>
>>> Emyr
>>>
>>>
>

Reply via email to