Re: Indexing Text File By Sections In Lucene

Prakash Dubey Thu, 04 Sep 2014 02:27:04 -0700

Hello Everyone,
I'm not pretty sure that this is the best solution for your problem, but
following link PDFBox Extracting Paragraphs
<http://stackoverflow.com/questions/9451312/pdfbox-extracting-paragraphs> might
help.


Hope this help!

Thanks and Regards
Prakash Kumar Dubey


On Thu, Sep 4, 2014 at 1:22 PM, Charlie Hull <[email protected]> wrote:

> On 04/09/2014 07:09, sunilragidi wrote:
>
>> Hi, I have a requirement in which I have to index a text file using
>> Lucene.
>>
>> The text file data if from a PDF file. I have used Tika to extract text
>> from
>> PDF and put it into the text file.
>>
>
> This may be your mistake - IIRC Tika isn't great at preserving structure
> within PDFs. We had a similar requirement a while ago to index large PDFs
> by paragraphs, and the paragraph markers were being lost. I suggest you
> look at other ways of extracting the plain text - pdftotext may preserve
> more of the structure, I think that's what we used. Once you have the
> individual sections you can index them as separate documents in Solr, with
> metadata to indicate the document they came from.
>
> HTH
>
> Charlie
>
>
>> I want to index the text file in the following way.
>>
>>      1. I don't want to index the whole text file content.
>>      2. I don't want to index sentence by sentence.
>>      3. Instead, I want to index the text file by sections.(The text file
>> is
>> huge)
>>
>> How can I do this? Any help would be greatly appreciated.
>>
>> --Sunil
>>
>>
>>
>> --
>> View this message in context: http://lucene.472066.n3.
>> nabble.com/Indexing-Text-File-By-Sections-In-Lucene-tp4156843.html
>> Sent from the Lucene - General mailing list archive at Nabble.com.
>>
>>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

Re: Indexing Text File By Sections In Lucene

Reply via email to