subject:"Re\: Text Only Extraction Using Solr and Tika"

Re: Text Only Extraction Using Solr and Tika

2011-05-05 Thread Jay Luker

Hi Emyr,

You could try using the extractOnly=true parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr James emyr.ja...@sussex.ac.uk wrote:
 Hi All,

 I have solr and tika installed and am happily extracting and indexing
 various files.
 Unfortunately on some word documents it blows up since it tries to
 auto-generate a 'title' field but my title field in the schema is single
 valued.

 Here is my config for the extract handler...

 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 lst name=defaults
 str name=uprefixignored_/str
 /lst
 /requestHandler

 Is there a config option to make it only extract text, or ideally to allow
 me to specify which metadata fields to accept ?

 E.g. I'd like to use any author metadata it finds but to not use any title
 metadata it finds as I want title to be single valued and set explicitly
 using a literal.title in the post request.

 I did look around for some docs but all i can find are very basic examples.
 there's no comprehensive configuration documentation out there as far as I
 can tell.


 ALSO...

 I get some other bad responses coming back such as...

 htmlheadtitleApache Tomcat/6.0.28 - Error report/titlestyle!--H1
 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
 H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
 525D76;font-size:16px;} H3
 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
 BODY
 {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
 {font-family:Tahoma,Arial,sans-serif;c
 olor:white;background-color:#525D76;} P
 {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
 {color : black;}A.name {color : black;}HR {color : #525D76;}--/style
 /headbodyh1HTTP Status 500 - org.ap
 ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

 java.lang.NoSuchMethodError:
 org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
    at
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
    at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
    at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
    at
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
    at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
    at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
    at
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
    at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
    at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
    at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
    at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
    at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
    at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
    at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
    at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
    at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
    at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
    at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
    at
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
    at
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
    at
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
    at java.lang.Thread.run(Thread.java:636)
 /h1HR size=1 noshade=noshadepbtype/b Status
 report/ppbmessage/b
 uorg.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

 For the above my url was...

  http://localhost:8080/solr/update/extract?literal.id=3922defaultField=contentfmap.content=contentuprefix=ignored_stream.contentType=application%2Fvnd.ms-powerpointcommit=trueliteral.title=Reactor+cycle+141literal.not
 es=literal.tag=UCN_productionliteral.author=Maurits+van+der+Grinten

 I guess there's something special I need to be able to process power point
 files ? Maybe I need to get the latest apache POI ? Any suggestions
 welcome...


 Regards,

 Emyr

Re: Text Only Extraction Using Solr and Tika

2011-05-05 Thread Emyr James

Thanks for the suggestion but there surely must be a better way than 
that to do it ?
I don't want to post the whole file up, get it extracted on the server, 
send the extracted text back to the client then send it all back up to 
the server again as plain text.


On 05/05/11 14:55, Jay Luker wrote:

Hi Emyr,

You could try using the extractOnly=true parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr Jamesemyr.ja...@sussex.ac.uk  wrote:

Hi All,

I have solr and tika installed and am happily extracting and indexing
various files.
Unfortunately on some word documents it blows up since it tries to
auto-generate a 'title' field but my title field in the schema is single
valued.

Here is my config for the extract handler...

requestHandler name=/update/extract
class=org.apache.solr.handler.extraction.ExtractingRequestHandler
lst name=defaults
str name=uprefixignored_/str
/lst
/requestHandler

Is there a config option to make it only extract text, or ideally to allow
me to specify which metadata fields to accept ?

E.g. I'd like to use any author metadata it finds but to not use any title
metadata it finds as I want title to be single valued and set explicitly
using a literal.title in the post request.

I did look around for some docs but all i can find are very basic examples.
there's no comprehensive configuration documentation out there as far as I
can tell.


ALSO...

I get some other bad responses coming back such as...

htmlheadtitleApache Tomcat/6.0.28 - Error report/titlestyle!--H1
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
525D76;font-size:16px;} H3
{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY
{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B
{font-family:Tahoma,Arial,sans-serif;c
olor:white;background-color:#525D76;} P
{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color : #525D76;}--/style
/headbodyh1HTTP Status 500 - org.ap
ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

java.lang.NoSuchMethodError:
org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
at java.lang.Thread.run(Thread.java:636)
/h1HR size=1 noshade=noshadepbtype/b  Status
report/ppbmessage/b
uorg.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

For the above my url was...

  
http://localhost:8080/solr/update/extract?literal.id=3922defaultField=contentfmap.content=contentuprefix=ignored_stream.contentType=application%2Fvnd.ms-powerpointcommit=trueliteral.title=Reactor+cycle+141literal.not

Re: Text Only Extraction Using Solr and Tika

2011-05-05 Thread Anuj Kumar

Hi Emyr,

You can try the XPath based approach and see if that works. Also, see if
dynamic fields can help you for the meta data fields.

References-
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

Regards,
Anuj

On Thu, May 5, 2011 at 7:28 PM, Emyr James emyr.ja...@sussex.ac.uk wrote:

 Thanks for the suggestion but there surely must be a better way than that
 to do it ?
 I don't want to post the whole file up, get it extracted on the server,
 send the extracted text back to the client then send it all back up to the
 server again as plain text.


 On 05/05/11 14:55, Jay Luker wrote:

 Hi Emyr,

 You could try using the extractOnly=true parameter [1]. Of course,
 you'll need to repost the extracted text manually.

 --jay

 [1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


 On Thu, May 5, 2011 at 9:36 AM, Emyr Jamesemyr.ja...@sussex.ac.uk
  wrote:

 Hi All,

 I have solr and tika installed and am happily extracting and indexing
 various files.
 Unfortunately on some word documents it blows up since it tries to
 auto-generate a 'title' field but my title field in the schema is single
 valued.

 Here is my config for the extract handler...

 requestHandler name=/update/extract
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 lst name=defaults
 str name=uprefixignored_/str
 /lst
 /requestHandler

 Is there a config option to make it only extract text, or ideally to
 allow
 me to specify which metadata fields to accept ?

 E.g. I'd like to use any author metadata it finds but to not use any
 title
 metadata it finds as I want title to be single valued and set explicitly
 using a literal.title in the post request.

 I did look around for some docs but all i can find are very basic
 examples.
 there's no comprehensive configuration documentation out there as far as
 I
 can tell.


 ALSO...

 I get some other bad responses coming back such as...

 htmlheadtitleApache Tomcat/6.0.28 - Error
 report/titlestyle!--H1

 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
 H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
 525D76;font-size:16px;} H3

 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
 BODY
 {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
 B
 {font-family:Tahoma,Arial,sans-serif;c
 olor:white;background-color:#525D76;} P

 {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
 {color : black;}A.name {color : black;}HR {color : #525D76;}--/style
 /headbodyh1HTTP Status 500 - org.ap
 ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

 java.lang.NoSuchMethodError:

 org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
at

 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
at

 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
at

 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at

 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at

 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at

 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at

 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at

 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at

 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at

 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at

 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at

 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at

 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
at

 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
at

 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
at

 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
at

 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
at

Re: Text Only Extraction Using Solr and Tika

2011-05-05 Thread Emyr James


Hi,
I'm not really sure how these can help with my problem. Can you give a 
bit more info on this ?


I think what i'm after is a fairly common request..

http://lucene.472066.n3.nabble.com/Controlling-Tika-s-metadata-td2378677.html
http://lucene.472066.n3.nabble.com/Select-tika-output-for-extract-only-td499059.html#a499062

Did the change that Yonik Seely mentions to allow more control over the 
output ever make it into 1.4 ?


Regards,
Emyr

On 05/05/11 15:01, Anuj Kumar wrote:

Hi Emyr,

You can try the XPath based approach and see if that works. Also, see if
dynamic fields can help you for the meta data fields.

References-
http://wiki.apache.org/solr/SchemaXml#Dynamic_fields
http://wiki.apache.org/solr/ExtractingRequestHandler#Input_Parameters
http://wiki.apache.org/solr/TikaExtractOnlyExampleOutput

Regards,
Anuj

On Thu, May 5, 2011 at 7:28 PM, Emyr Jamesemyr.ja...@sussex.ac.uk  wrote:


Thanks for the suggestion but there surely must be a better way than that
to do it ?
I don't want to post the whole file up, get it extracted on the server,
send the extracted text back to the client then send it all back up to the
server again as plain text.


On 05/05/11 14:55, Jay Luker wrote:


Hi Emyr,

You could try using the extractOnly=true parameter [1]. Of course,
you'll need to repost the extracted text manually.

--jay

[1] http://wiki.apache.org/solr/ExtractingRequestHandler#Extract_Only


On Thu, May 5, 2011 at 9:36 AM, Emyr Jamesemyr.ja...@sussex.ac.uk
  wrote:


Hi All,

I have solr and tika installed and am happily extracting and indexing
various files.
Unfortunately on some word documents it blows up since it tries to
auto-generate a 'title' field but my title field in the schema is single
valued.

Here is my config for the extract handler...

requestHandler name=/update/extract
class=org.apache.solr.handler.extraction.ExtractingRequestHandler
lst name=defaults
str name=uprefixignored_/str
/lst
/requestHandler

Is there a config option to make it only extract text, or ideally to
allow
me to specify which metadata fields to accept ?

E.g. I'd like to use any author metadata it finds but to not use any
title
metadata it finds as I want title to be single valued and set explicitly
using a literal.title in the post request.

I did look around for some docs but all i can find are very basic
examples.
there's no comprehensive configuration documentation out there as far as
I
can tell.


ALSO...

I get some other bad responses coming back such as...

htmlheadtitleApache Tomcat/6.0.28 - Error
report/titlestyle!--H1

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
525D76;font-size:16px;} H3

{font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
BODY
{font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;}
B
{font-family:Tahoma,Arial,sans-serif;c
olor:white;background-color:#525D76;} P

{font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
{color : black;}A.name {color : black;}HR {color : #525D76;}--/style
/headbodyh1HTTP Status 500 - org.ap
ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

java.lang.NoSuchMethodError:

org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
at

org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
at

org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
at

org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
at

org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
at

org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
at

org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
at

org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
at

org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
at

org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
at

org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
at

org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
at

org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
at

org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)

Re: Text Only Extraction Using Solr and Tika

2011-05-05 Thread Ramirez, Paul M (388J)

Hey Emyr,

Looking at your stack trace below my guess is that you have two conflicting 
Apache POI jars in your classpath. The odd stack trace is indicative of that as 
the class loader is likely loading some other version of  the DirectoryNode 
class that doesn't have the iterator method. 

 java.lang.NoSuchMethodError: 
 org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;

Thanks,
Paul Ramirez


On May 5, 2011, at 6:36 AM, Emyr James wrote:

 Hi All,
 
 I have solr and tika installed and am happily extracting and indexing 
 various files.
 Unfortunately on some word documents it blows up since it tries to 
 auto-generate a 'title' field but my title field in the schema is single 
 valued.
 
 Here is my config for the extract handler...
 
 requestHandler name=/update/extract 
 class=org.apache.solr.handler.extraction.ExtractingRequestHandler
 lst name=defaults
 str name=uprefixignored_/str
 /lst
 /requestHandler
 
 Is there a config option to make it only extract text, or ideally to 
 allow me to specify which metadata fields to accept ?
 
 E.g. I'd like to use any author metadata it finds but to not use any 
 title metadata it finds as I want title to be single valued and set 
 explicitly using a literal.title in the post request.
 
 I did look around for some docs but all i can find are very basic 
 examples. there's no comprehensive configuration documentation out there 
 as far as I can tell.
 
 
 ALSO...
 
 I get some other bad responses coming back such as...
 
 htmlheadtitleApache Tomcat/6.0.28 - Error 
 report/titlestyle!--H1 
 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:22px;}
  
 H2 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#
 525D76;font-size:16px;} H3 
 {font-family:Tahoma,Arial,sans-serif;color:white;background-color:#525D76;font-size:14px;}
  
 BODY 
 {font-family:Tahoma,Arial,sans-serif;color:black;background-color:white;} B 
 {font-family:Tahoma,Arial,sans-serif;c
 olor:white;background-color:#525D76;} P 
 {font-family:Tahoma,Arial,sans-serif;background:white;color:black;font-size:12px;}A
  
 {color : black;}A.name {color : black;}HR {color : #525D76;}--/style 
 /headbodyh1HTTP Status 500 - org.ap
 ache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
 
 java.lang.NoSuchMethodError: 
 org.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
 at 
 org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:168)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
 at 
 org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
 at 
 org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:148)
 at 
 org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:190)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:54)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
 at 
 org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:233)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1316)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:338)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:241)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:235)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:233)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:298)
 at 
 org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:857)
 at 
 org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.java:588)
 at 
 org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:489)
 at java.lang.Thread.run(Thread.java:636)
 /h1HR size=1 noshade=noshadepbtype/b Status 
 report/ppbmessage/b 
 uorg.apache.poi.poifs.filesystem.DirectoryNode.iterator()Ljava/util/Iterator;
 
 For the above my url was...
 
  
 http://localhost:8080/solr/update/extract?literal.id=3922defaultField=contentfmap.content=contentuprefix=ignored_stream.contentType=application%2Fvnd.ms-powerpointcommit=trueliteral.title=Reactor+cycle+141literal.not

Re: Text Only Extraction Using Solr and Tika

Re: Text Only Extraction Using Solr and Tika

Re: Text Only Extraction Using Solr and Tika

Re: Text Only Extraction Using Solr and Tika

Re: Text Only Extraction Using Solr and Tika

5 matches

Site Navigation

Mail list logo

Footer information