RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

Ganesh.Yadav Tue, 06 Jan 2015 11:42:31 -0800

Still looking for answer on Schema.xml and SolrConfig.xml


1.       Do I need to tell Solr, to extract Title from PDF, go look for Title 
word and extract entire line after the Tag and collect all such occurrence’s 
from hundreds of PDFs and build the Title column data and index it?


2.       How to define my own schema to Solr

3.       Say I defined my fields Title, Ticket_number, Submitter, client and so 
on, How can I verify respective data is extracted in specific columns in Solr 
and indexed? Any suggestion on how the Analyzer, Tokenizer and Filter and which 
one will help for this purpose?


1.       I do not want to dump entire 4 GB PDF contents in one searchable field 
(ATTR_CONTENT) in Solr

2.       Even if entire PDF contents is extracted in above field as a default, 
I still want to extract specific searchable column data in their respective 
fields

3.       Rather I want to configure Solr to have column wise searchable 
contents such as Title, number, and so on

Any suggestions on performance? PDF database is 80 GB, will it be fast enough? 
Do I Need to divide in multiple cores and on multiple machines ? and on 
multiple web apps? And clustering?


I should have mentioned my PDFs are from Ticketing system like Jira which is 
already retired way back from production and all I have is the Ticketing system 
PDF database.


4.       My system will be used internally just by the selected number of very 
few people.

5.       They can wait 4 GB PDF to get loaded.

6.       I agree there will be many matches will be found in one large PDF, 
based on search criteria

7.       To make searches faster I want Solr to create more columns and column 
based indexes

8.       Solr underneath uses Tika which is extracting contents and getting rid 
of all the rich content formatting characters present in the PDF document.

9.       I believe resulting extraction size is 1/5th of the original PDF 
..just a random guess based on one sample extraction




From: "Jürgen Wagner (DVT)" [mailto:juergen.wag...@devoteam.com]
Sent: Tuesday, January 06, 2015 11:56 AM
To: solr-user@lucene.apache.org
Subject: Re: PDF search functionality using Solr

Hello,
  no matter which search platform you will use, this will pose two challenges:

- The size of the documents will render search less and less useful as the 
likelihood of matches increases with document size. So, without a proper 
semantic extraction (e.g., using decent NER or relationship extraction with a 
commercial text mining product), I doubt you will get the required precision to 
make this overly usefiul.

- PDFs can have their own character sets based on the characters actually used. 
Such file-specific character sets are almost impossible to parse, i.e., if your 
PDFs happen to use this "feature" of the PDF format, you won't be lucky getting 
any meaningful text out of them.

My suggestion is to use the Jira REST API to collect all necessary documents 
and index the resulting XML or attachment formats. As the REST API provides 
filtering capabilities, you could easily create incremental feeds to avoid 
humongous indexing every time there's new information in Jira. Dumping Jira 
stuff as PDF seems to me to be the least suitable way of handling this.

Best regards,
--Jürgen


On 06.01.2015 18:30, ganesh.ya...@sungard.com<mailto:ganesh.ya...@sungard.com> 
wrote:

Hello Solr-users and developers,

Can you please suggest,



1.       What I should do to index PDF content information column wise?



2.       Do I need to extract the contents using one of the Analyzer, Tokenize 
and Filter combination and then add it to Index? How can test the results on 
command prompt? I do not know the selection of specific Analyzer, Tokenizer and 
Filter for this purpose



3.       How can I verify that the needed column info is extracted out of PDF 
and is indexed?



4.       So for example How to verify Ticket number is extracted in 
Ticket_number tag and is indexed?



5.       Is it ok to post 4 GB worth of PDF to be imported and indexed by Solr? 
I think I saw some posts complaining on how large size that can be posted ?



6.       What will enable Solr to search in any PDF out of many, with different 
words such as "Runtime" "Error" "XXXX" and result will provide the link to the 
PDF



My PDFs are nothing but Jira ticket system.

PDF has info on

Ticket Number:

Desc:

Client:

Status:

Submitter:

And so on:





1.       I imported PDF document in Solr and it does the necessary searching 
and I can test some of it using the browse client interface provided.



2.       I have 80 GB worth of PDFs.



3.       Total number of PDFs are about 200



4.       Many PDFs are of size 4 GB



5.       What do you suggest me to import such a large PDFs? What tools can you 
suggest to extract PDF contents first in some XML format and later Post that 
XML to be indexed by Solr.?















Your early response is much appreciated.







Thanks



G





--

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением 
i.A. Jürgen Wagner Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: juergen.wag...@devoteam.com<mailto:juergen.wag...@devoteam.com>, URL: 
www.devoteam.de<http://www.devoteam.de/>

________________________________
Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 
Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax 
Number: DE 172 993 071

RE: PDF search functionality using Solr Schema.xml and SolrConfig.xml question

Reply via email to