Re: PDF search functionality using Solr

Erick Erickson Tue, 06 Jan 2015 10:52:47 -0800

Seconding Jürgen's comment. 4G docs are almost, but not quite totally
useless to search How many JIRA's each? That's _one_ document unless
you do some fancy dancing. Pulling the data directly using the JIRA
API sounds far superior.


If you _must_ use the JIRA->PDF->Solr option, consider the following:
Use Tika on the client to parse the doc and taking control of the
mapping of the meta-data
and, probably, breaking thins up into individual document, one Solr
document per JIRA.

That'll give you a chance to deal with charset issues and the like.
Here's an example:

https://lucidworks.com/blog/indexing-with-solrj/

That one has both Tika and database connectivity but should be pretty
straight-forward to adapt, just pull the database junk out.

Best,
Erick

On Tue, Jan 6, 2015 at 9:55 AM, "Jürgen Wagner (DVT)"
<juergen.wag...@devoteam.com> wrote:
> Hello,
>   no matter which search platform you will use, this will pose two
> challenges:
>
> - The size of the documents will render search less and less useful as the
> likelihood of matches increases with document size. So, without a proper
> semantic extraction (e.g., using decent NER or relationship extraction with
> a commercial text mining product), I doubt you will get the required
> precision to make this overly usefiul.
>
> - PDFs can have their own character sets based on the characters actually
> used. Such file-specific character sets are almost impossible to parse,
> i.e., if your PDFs happen to use this "feature" of the PDF format, you won't
> be lucky getting any meaningful text out of them.
>
> My suggestion is to use the Jira REST API to collect all necessary documents
> and index the resulting XML or attachment formats. As the REST API provides
> filtering capabilities, you could easily create incremental feeds to avoid
> humongous indexing every time there's new information in Jira. Dumping Jira
> stuff as PDF seems to me to be the least suitable way of handling this.
>
> Best regards,
> --Jürgen
>
>
>
> On 06.01.2015 18:30, ganesh.ya...@sungard.com wrote:
>
> Hello Solr-users and developers,
> Can you please suggest,
>
> 1.       What I should do to index PDF content information column wise?
>
> 2.       Do I need to extract the contents using one of the Analyzer,
> Tokenize and Filter combination and then add it to Index? How can test the
> results on command prompt? I do not know the selection of specific Analyzer,
> Tokenizer and Filter for this purpose
>
> 3.       How can I verify that the needed column info is extracted out of
> PDF and is indexed?
>
> 4.       So for example How to verify Ticket number is extracted in
> Ticket_number tag and is indexed?
>
> 5.       Is it ok to post 4 GB worth of PDF to be imported and indexed by
> Solr? I think I saw some posts complaining on how large size that can be
> posted ?
>
> 6.       What will enable Solr to search in any PDF out of many, with
> different words such as "Runtime" "Error" "XXXX" and result will provide the
> link to the PDF
>
> My PDFs are nothing but Jira ticket system.
> PDF has info on
> Ticket Number:
> Desc:
> Client:
> Status:
> Submitter:
> And so on:
>
>
> 1.       I imported PDF document in Solr and it does the necessary searching
> and I can test some of it using the browse client interface provided.
>
> 2.       I have 80 GB worth of PDFs.
>
> 3.       Total number of PDFs are about 200
>
> 4.       Many PDFs are of size 4 GB
>
> 5.       What do you suggest me to import such a large PDFs? What tools can
> you suggest to extract PDF contents first in some XML format and later Post
> that XML to be indexed by Solr.?
>
>
>
>
>
>
>
> Your early response is much appreciated.
>
>
>
> Thanks
>
> G
>
>
>
>
> --
>
> Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
> уважением
> i.A. Jürgen Wagner
> Head of Competence Center "Intelligence"
> & Senior Cloud Consultant
>
> Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
> Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
> E-Mail: juergen.wag...@devoteam.com, URL: www.devoteam.de
>
> ________________________________
> Managing Board: Jürgen Hatzipantelis (CEO)
> Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
> Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071
>
>

Re: PDF search functionality using Solr

Reply via email to