Seconding Jürgen's comment. 4G docs are almost, but not quite totally useless to search How many JIRA's each? That's _one_ document unless you do some fancy dancing. Pulling the data directly using the JIRA API sounds far superior.
If you _must_ use the JIRA->PDF->Solr option, consider the following: Use Tika on the client to parse the doc and taking control of the mapping of the meta-data and, probably, breaking thins up into individual document, one Solr document per JIRA. That'll give you a chance to deal with charset issues and the like. Here's an example: https://lucidworks.com/blog/indexing-with-solrj/ That one has both Tika and database connectivity but should be pretty straight-forward to adapt, just pull the database junk out. Best, Erick On Tue, Jan 6, 2015 at 9:55 AM, "Jürgen Wagner (DVT)" <juergen.wag...@devoteam.com> wrote: > Hello, > no matter which search platform you will use, this will pose two > challenges: > > - The size of the documents will render search less and less useful as the > likelihood of matches increases with document size. So, without a proper > semantic extraction (e.g., using decent NER or relationship extraction with > a commercial text mining product), I doubt you will get the required > precision to make this overly usefiul. > > - PDFs can have their own character sets based on the characters actually > used. Such file-specific character sets are almost impossible to parse, > i.e., if your PDFs happen to use this "feature" of the PDF format, you won't > be lucky getting any meaningful text out of them. > > My suggestion is to use the Jira REST API to collect all necessary documents > and index the resulting XML or attachment formats. As the REST API provides > filtering capabilities, you could easily create incremental feeds to avoid > humongous indexing every time there's new information in Jira. Dumping Jira > stuff as PDF seems to me to be the least suitable way of handling this. > > Best regards, > --Jürgen > > > > On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote: > > Hello Solr-users and developers, > Can you please suggest, > > 1. What I should do to index PDF content information column wise? > > 2. Do I need to extract the contents using one of the Analyzer, > Tokenize and Filter combination and then add it to Index? How can test the > results on command prompt? I do not know the selection of specific Analyzer, > Tokenizer and Filter for this purpose > > 3. How can I verify that the needed column info is extracted out of > PDF and is indexed? > > 4. So for example How to verify Ticket number is extracted in > Ticket_number tag and is indexed? > > 5. Is it ok to post 4 GB worth of PDF to be imported and indexed by > Solr? I think I saw some posts complaining on how large size that can be > posted ? > > 6. What will enable Solr to search in any PDF out of many, with > different words such as "Runtime" "Error" "XXXX" and result will provide the > link to the PDF > > My PDFs are nothing but Jira ticket system. > PDF has info on > Ticket Number: > Desc: > Client: > Status: > Submitter: > And so on: > > > 1. I imported PDF document in Solr and it does the necessary searching > and I can test some of it using the browse client interface provided. > > 2. I have 80 GB worth of PDFs. > > 3. Total number of PDFs are about 200 > > 4. Many PDFs are of size 4 GB > > 5. What do you suggest me to import such a large PDFs? What tools can > you suggest to extract PDF contents first in some XML format and later Post > that XML to be indexed by Solr.? > > > > > > > > Your early response is much appreciated. > > > > Thanks > > G > > > > > -- > > Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С > уважением > i.A. Jürgen Wagner > Head of Competence Center "Intelligence" > & Senior Cloud Consultant > > Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany > Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 > E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de > > ________________________________ > Managing Board: Jürgen Hatzipantelis (CEO) > Address of Record: 64331 Weiterstadt, Germany; Commercial Register: > Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 > >