Re: Regarding pdf indexing issue

Terry Steichen Wed, 11 Jul 2018 09:00:11 -0700

Walter,

Well said.  (And I love the hamburger conversion analogy - very apt.)


The only thing I will add is that when you have a collection of similar
rich text documents, you might be able to construct queries to respect
internal structures within the documents.  If all/most of your documents
have a unique line like "subject:", you might be able to be selective.

Also, if your documents are organized on disk in some categorical way,
you can include in your query, a reference to that categorical
information (via the id:*pattern* field).

Finally, there *might* be useful information in the metadata that you
can use in refining your searches.

Terry


On 07/11/2018 11:42 AM, Walter Underwood wrote:
> PDF is not a structured document format. It is a printer control format.
>
> PDF does not have a paragraph marker. Instead, it says to move
> to this spot on the page, choose this font, and print this letter. For a
> paragraph, it moves farther. For the next letter in a word, it moves a 
> little bit. Extracting paragraphs from that is a difficult pattern recognition
> problem.
>
> I worked with a PDF of a two-column magazine article that printed
> the first line of column 1, then the first line of column 2, then the 
> second line of column 1, and so on. If a line ended with a hyphenated
> word, too bad.
>
> Extracting structure from a PDF document is somewhere between 
> very hard and impossible. Someone I worked with said that getting
> structured text from PDF was like turning hamburger back into a cow.
>
> Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that
> is used. It appears to be an accessibility feature, so it still might not
> be useful for search.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>> On Jul 11, 2018, at 8:07 AM, Erick Erickson <erickerick...@gmail.com> wrote:
>>
>> Solr will not do this automatically, the Extracting Request Handler
>> simply indexes the entire contents of the doc without regard to things
>> like paragraphs etc. Ditto with HTML. This is actually a task that
>> requires getting into Tika and using all the bells and whistles there.
>>
>> I'd recommend two things:
>>
>> 1> Take the PDF parsing offline, i.e. in a separate client. There are
>> many reasons for this, in particular you can attempt to do what you're
>> asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/
>>
>> 2> Talk to the Tika folks about the best ways to make Tika return the
>> information such that you can index them and get what you'd like.
>>
>> Best,
>> Erick
>>
>> On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi
>> <rdwiv...@bestpeers.com> wrote:
>>> Hello Team,
>>>
>>> I am using the Solr for indexing and searching for pdf document
>>>
>>> I have go through with your website document and installed solr but unable
>>> to index and search the document.
>>>
>>> For example: Suppose we have a PDF file which have no of paragraph with
>>> separate heading.
>>>
>>> So If I search for the title on indexed pdf the result should be contain
>>> the paragraph from where the title belongs.
>>>
>>> I am unable to perform this task.
>>>
>>> I have run the below command for upload the pdf
>>>
>>> *bin/post -c gettingstarted pdf-sample.pdf*
>>>
>>> and for searching I am running the command
>>>
>>> *curl http://localhost:8983/solr/gettingstarted/select?q='*
>>> <http://localhost:8983/solr/gettingstarted/select?q='*>'*
>>>
>>> Please suggest me anything and let me know if I am missing anything
>>>
>>> Thanks,
>>>
>>> Rahul
>

Re: Regarding pdf indexing issue

Reply via email to