Walter, Well said. (And I love the hamburger conversion analogy - very apt.)
The only thing I will add is that when you have a collection of similar rich text documents, you might be able to construct queries to respect internal structures within the documents. If all/most of your documents have a unique line like "subject:", you might be able to be selective. Also, if your documents are organized on disk in some categorical way, you can include in your query, a reference to that categorical information (via the id:*pattern* field). Finally, there *might* be useful information in the metadata that you can use in refining your searches. Terry On 07/11/2018 11:42 AM, Walter Underwood wrote: > PDF is not a structured document format. It is a printer control format. > > PDF does not have a paragraph marker. Instead, it says to move > to this spot on the page, choose this font, and print this letter. For a > paragraph, it moves farther. For the next letter in a word, it moves a > little bit. Extracting paragraphs from that is a difficult pattern recognition > problem. > > I worked with a PDF of a two-column magazine article that printed > the first line of column 1, then the first line of column 2, then the > second line of column 1, and so on. If a line ended with a hyphenated > word, too bad. > > Extracting structure from a PDF document is somewhere between > very hard and impossible. Someone I worked with said that getting > structured text from PDF was like turning hamburger back into a cow. > > Since Acrobat 5, there is “tagged PDF”. I’m not sure how widely that > is used. It appears to be an accessibility feature, so it still might not > be useful for search. > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > >> On Jul 11, 2018, at 8:07 AM, Erick Erickson <erickerick...@gmail.com> wrote: >> >> Solr will not do this automatically, the Extracting Request Handler >> simply indexes the entire contents of the doc without regard to things >> like paragraphs etc. Ditto with HTML. This is actually a task that >> requires getting into Tika and using all the bells and whistles there. >> >> I'd recommend two things: >> >> 1> Take the PDF parsing offline, i.e. in a separate client. There are >> many reasons for this, in particular you can attempt to do what you're >> asking. See: https://lucidworks.com/2012/02/14/indexing-with-solrj/ >> >> 2> Talk to the Tika folks about the best ways to make Tika return the >> information such that you can index them and get what you'd like. >> >> Best, >> Erick >> >> On Wed, Jul 11, 2018 at 6:35 AM, Rahul Prasad Dwivedi >> <rdwiv...@bestpeers.com> wrote: >>> Hello Team, >>> >>> I am using the Solr for indexing and searching for pdf document >>> >>> I have go through with your website document and installed solr but unable >>> to index and search the document. >>> >>> For example: Suppose we have a PDF file which have no of paragraph with >>> separate heading. >>> >>> So If I search for the title on indexed pdf the result should be contain >>> the paragraph from where the title belongs. >>> >>> I am unable to perform this task. >>> >>> I have run the below command for upload the pdf >>> >>> *bin/post -c gettingstarted pdf-sample.pdf* >>> >>> and for searching I am running the command >>> >>> *curl http://localhost:8983/solr/gettingstarted/select?q='* >>> <http://localhost:8983/solr/gettingstarted/select?q='*>'* >>> >>> Please suggest me anything and let me know if I am missing anything >>> >>> Thanks, >>> >>> Rahul >