Re: ExtractRequestHandler and Tika. Get only plain text

Sergio García Maroto Wed, 14 Nov 2018 06:46:56 -0800

Thanks a lot Jan.
That works very well.

I am now trying to index the doc in Solr deleting the extractOnly parameter
and can't find any similiar option to get the data indexed in plain text. I
am getting the metadata as well,
This is my request.
http://localhost:8983/solr/document/update/extract?iteral.id=DDOC001&stream.file=C
:\TIKA\FileTest\Test.txt&commit=true&fmap.content=DocContentS


My DocContentS contains
\n \n stream_size 13 \n X-Parsed-By org.apache.tika.parser.DefaultParser \n
X-Parsed-By org.apache.tika.parser.txt.TXTParser \n stream_name Test.txt
\n stream_source_info file:/C:/TIKA/FileTest/Test.txt \n Content-Encoding
ISO-8859-1 \n Content-Type text/plain; charset=ISO-8859-1 \n \n \n Prueba
Sergio \n "

I can't find anywhere how to modify this behaviour.




On Wed, 14 Nov 2018 at 13:06, Jan Høydahl <jan....@cominvent.com> wrote:

> Have you tried to specify &extractFormat=text
>
> --
> Jan Høydahl, search solution architect
> Cominvent AS - www.cominvent.com
>
> > 14. nov. 2018 kl. 12:09 skrev marotosg <marot...@gmail.com>:
> >
> > Hi all,
> >
> > Currently I am trying to do index documents from different kinds with
> Solr
> > and tika. It's working fine but when solr returns the content of the
> > document. Doesn't return the plain text.  It comes back as well with some
> > metadata.
> >
> > For instance my request.
> >
> http://localhost:8983/solr/document/update/extract?extractOnly=true&stream.file=C
> :\TIKA\FileTest\Test.txt
> >
> > Content of Test.txt file is just "*Test File*".
> >
> > Response from Solr as you can see below returns plenty of information.
> > I would the answer to be something like this without noise for the
> search.
> > <str name="Test.txt">
> > Test File
> > </str>
> >
> > <response>
> > <lst name="responseHeader">
> > <int name="status">0</int>
> > <int name="QTime">135</int>
> > </lst>
> > <str name="Test.txt">
> > <?xml version="1.0" encoding="UTF-8"?> <html
> > xmlns="http://www.w3.org/1999/xhtml";> <head> <meta name="stream_size"
> > content="13"/> <meta name="X-Parsed-By"
> > content="org.apache.tika.parser.DefaultParser"/> <meta name="X-Parsed-By"
> > content="org.apache.tika.parser.txt.TXTParser"/> <meta name="stream_name"
> > content="Test.txt"/> <meta name="stream_source_info"
> > content="file:/C:/TIKA/FileTest/Test.txt"/> <meta name="Content-Encoding"
> > content="ISO-8859-1"/> <meta name="Content-Type" content="text/plain;
> > charset=ISO-8859-1"/> <title></title> </head> <body> <p>Test File</p>
> > </body> </html>
> > </str>
> > <lst name="Test.txt_metadata">
> > <arr name="stream_size">
> > <str>13</str>
> > </arr>
> > <arr name="X-Parsed-By">
> > <str>org.apache.tika.parser.DefaultParser</str>
> > <str>org.apache.tika.parser.txt.TXTParser</str>
> > </arr>
> > <arr name="stream_name">
> > <str>Test.txt</str>
> > </arr>
> > <arr name="stream_source_info">
> > <str>file:/C:/TIKA/FileTest/Test.txt</str>
> > </arr>
> > <arr name="Content-Encoding">
> > <str>ISO-8859-1</str>
> > </arr>
> > <arr name="Content-Type">
> > <str>text/plain; charset=ISO-8859-1</str>
> > </arr>
> > </lst>
> > </response>
> >
> > Can anyone give some light here?
> > Thanks  a lot.
> >
> >
> >
> > --
> > Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html
>
>

Re: ExtractRequestHandler and Tika. Get only plain text

Reply via email to