CURRENTLY, I AM GETTING

 "_text_" :
[" \n \n date 2019-06-24T09:52:33Z  \n cp:revision 5  \n Total-Time 1  \n 
extended-properties:AppVersion 15.0000  \n stream_content_type 
application/vnd.openxmlformats-officedocument.presentationml.presentation  \n 
meta:paragraph-count 18  \n meta:word-count 20  \n 
extended-properties:PresentationFormat Widescreen  \n dc:creator Khare, Kushal 
(MIND)  \n extended-properties:Company MIND  \n Word-Count 20  \n 
dcterms:created 2019-06-18T07:25:29Z  \n dcterms:modified 2019-06-24T09:52:33Z  
\n Last-Modified 2019-06-24T09:52:33Z  \n Last-Save-Date 2019-06-24T09:52:33Z  
\n Paragraph-Count 18  \n meta:save-date 2019-06-24T09:52:33Z  \n dc:title 
PowerPoint Presentation  \n Application-Name Microsoft Office PowerPoint  \n 
extended-properties:TotalTime 1  \n modified 2019-06-24T09:52:33Z  \n 
Content-Type 
application/vnd.openxmlformats-officedocument.presentationml.presentation  \n 
Slide-Count 2  \n stream_size 32234  \n X-Parsed-By 
org.apache.tika.parser.DefaultParser  \n X-Parsed-By 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser  \n creator Khare, Kushal 
(MIND)  \n meta:author Khare, Kushal (MIND)  \n meta:creation-date 
2019-06-18T07:25:29Z  \n extended-properties:Application Microsoft Office 
PowerPoint  \n meta:last-author Khare, Kushal (MIND)  \n meta:slide-count 2  \n 
Creation-Date 2019-06-18T07:25:29Z  \n xmpTPg:NPages 2  \n resourceName 
D:\\docs\\DemoOutput.pptx  \n Last-Author Khare, Kushal (MIND)  \n 
Revision-Number 5  \n Application-Version 15.0000  \n Author Khare, Kushal 
(MIND)  \n publisher MIND  \n Presentation-Format Widescreen  \n dc:publisher 
MIND  \n PowerPoint Presentation \n \n  slide-content   \n Hello. This is just 
for Demo!  \n If you find it anywhere, throw it away !\nA.W.A.Y away away away 
away away Away AWAY! \n  \n  \n A.W.A.Y once again !  \n  \n  \n  \n  \n  \n  
\n  \n  \n  \n  \n  \n  \n \n slide-master-content  \n slide-content   \n 
A.W.A.Y \n  \n away \n \n slide-master-content  \n embedded 
/docProps/thumbnail.jpeg    "],

WHAT I WANT :

"_text_"  :
["\n  slide-content   \n Hello. This is just for Demo!  \n If you find it 
anywhere, throw it away !\nA.W.A.Y away away away away away Away AWAY! \n  \n  
\n A.W.A.Y once again !  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n  \n \n 
slide-master-content  \n slide-content   \n A.W.A.Y \n  \n away \n \n 
slide-master-content  \n embedded /docProps/thumbnail.jpeg    "],

"meta" : ["\n \n date 2019-06-24T09:52:33Z  \n cp:revision 5  \n Total-Time 1  
\n extended-properties:AppVersion 15.0000  \n stream_content_type 
application/vnd.openxmlformats-officedocument.presentationml.presentation  \n 
meta:paragraph-count 18  \n meta:word-count 20  \n 
extended-properties:PresentationFormat Widescreen  \n dc:creator Khare, Kushal 
(MIND)  \n extended-properties:Company MIND  \n Word-Count 20  \n 
dcterms:created 2019-06-18T07:25:29Z  \n dcterms:modified 2019-06-24T09:52:33Z  
\n Last-Modified 2019-06-24T09:52:33Z  \n Last-Save-Date 2019-06-24T09:52:33Z  
\n Paragraph-Count 18  \n meta:save-date 2019-06-24T09:52:33Z  \n dc:title 
PowerPoint Presentation  \n Application-Name Microsoft Office PowerPoint  \n 
extended-properties:TotalTime 1  \n modified 2019-06-24T09:52:33Z  \n 
Content-Type 
application/vnd.openxmlformats-officedocument.presentationml.presentation  \n 
Slide-Count 2  \n stream_size 32234  \n X-Parsed-By 
org.apache.tika.parser.DefaultParser  \n X-Parsed-By 
org.apache.tika.parser.microsoft.ooxml.OOXMLParser  \n creator Khare, Kushal 
(MIND)  \n meta:author Khare, Kushal (MIND)  \n meta:creation-date 
2019-06-18T07:25:29Z  \n extended-properties:Application Microsoft Office 
PowerPoint  \n meta:last-author Khare, Kushal (MIND)  \n meta:slide-count 2  \n 
Creation-Date 2019-06-18T07:25:29Z  \n xmpTPg:NPages 2  \n resourceName 
D:\\docs\\DemoOutput.pptx  \n Last-Author Khare, Kushal (MIND)  \n 
Revision-Number 5  \n Application-Version 15.0000  \n Author Khare, Kushal 
(MIND)  \n publisher MIND  \n Presentation-Format Widescreen  \n dc:publisher 
MIND  \n PowerPoint Presentation \n"]
-----Original Message-----
From: Shawn Heisey [mailto:apa...@elyograg.org]
Sent: 28 August 2019 14:18
To: solr-user@lucene.apache.org
Subject: Re: Require searching only for file content and not metadata

On 8/27/2019 7:18 AM, Khare, Kushal (MIND) wrote:
> Basically, what problem I am facing is - I am getting the textual content + 
> other metadata in my _text_ field. But, I want only the textual content 
> written inside the document.
> I tried various Request Handler Update Extract configurations, but none of 
> them worked for me.
> Please help me resolve this as I am badly stuck in this.

Controlling exactly what gets indexed in which fields is likely going to 
require that you write the indexing software yourself -- a program that 
extracts the data you want and sends it to Solr for indexing.

We do not recommend running the Extracting Request Handler in production
-- Tika is known to crash when given some documents (usually PDF files are the 
problematic ones, but other formats can cause it too), and if it crashes while 
running inside Solr, it will take Solr down with it.

Here is an example program that uses Tika for rich document parsing.  It also 
talks to a database, but that part could be easily removed or modified:

https://lucidworks.com/post/indexing-with-solrj/

Thanks,
Shawn

________________________________

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments. WARNING: Computer viruses can be transmitted via email. The 
recipient should check this email and any attachments for the presence of 
viruses. The company accepts no liability for any damage caused by any 
virus/trojan/worms/malicious code transmitted by this email. www.motherson.com

Reply via email to