Re: Indexing MS Powerpoint files with Lucene

Tomi NA Fri, 08 Sep 2006 00:56:59 -0700

On 9/7/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

Tomi NA wrote:
> On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote:
>> On Thu, 7 Sep 2006, Tomi NA wrote:
>> > On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote:
>> >> Is there any filter available for extracting text from MS
>> Powerpoint files
>> >> and indexing them?
>> >> The lucene website suggests the POI project, which, it seems does not
>> >> support PPT files as of now.
>> >
>> > http://jakarta.apache.org/poi/hslf/index.html
>> >
>> > It doesn't say poi doesn't support ppt. It just says support is
>> limited.
>> > Don't know exactly how limited, but certainly not useless for indexing
>> > purposes.
>>
>> Support for editing and adding things to PowerPoint files is limited, as
>> is getting out the finer points of fonts and positioning.
>
> Which brings me to another (off)topic: can lucene/nutch assign
> different weights to tokens in the same document field? An obvious
> example would be: "this text seems to be in large, bold, blinking
> letters: I'll assume it's more important than the surrounding 8px
> text."


No, it can't (at least not yet). As a workaround you can extract these
portions of text to another field (or multiple fields), and then add
them with a higher boost. Then, expand your queries so that they include
also this field. This way, if query matches these special tokens,
results will get higher rank because of matching on this boosted field.


I thought a workaround like that would be needed. Still, it could give
useful results...though as a nutch user, the possibility is mostly
theoretical for me, as probably none of the existing parsers take into
account the formatting information. I could be completely wrong here,
so please, feel free to correct me.

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing MS Powerpoint files with Lucene

Reply via email to