Hello Everyone, I'm not pretty sure that this is the best solution for your problem, but following link PDFBox Extracting Paragraphs <http://stackoverflow.com/questions/9451312/pdfbox-extracting-paragraphs> might help.
Hope this help! Thanks and Regards Prakash Kumar Dubey On Thu, Sep 4, 2014 at 1:22 PM, Charlie Hull <[email protected]> wrote: > On 04/09/2014 07:09, sunilragidi wrote: > >> Hi, I have a requirement in which I have to index a text file using >> Lucene. >> >> The text file data if from a PDF file. I have used Tika to extract text >> from >> PDF and put it into the text file. >> > > This may be your mistake - IIRC Tika isn't great at preserving structure > within PDFs. We had a similar requirement a while ago to index large PDFs > by paragraphs, and the paragraph markers were being lost. I suggest you > look at other ways of extracting the plain text - pdftotext may preserve > more of the structure, I think that's what we used. Once you have the > individual sections you can index them as separate documents in Solr, with > metadata to indicate the document they came from. > > HTH > > Charlie > > >> I want to index the text file in the following way. >> >> 1. I don't want to index the whole text file content. >> 2. I don't want to index sentence by sentence. >> 3. Instead, I want to index the text file by sections.(The text file >> is >> huge) >> >> How can I do this? Any help would be greatly appreciated. >> >> --Sunil >> >> >> >> -- >> View this message in context: http://lucene.472066.n3. >> nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html >> Sent from the Lucene - General mailing list archive at Nabble.com. >> >> > > -- > Charlie Hull > Flax - Open Source Enterprise Search > > tel/fax: +44 (0)8700 118334 > mobile: +44 (0)7767 825828 > web: www.flax.co.uk >
