Re: Indexing MS Powerpoint files with Lucene

Andrzej Bialecki Thu, 07 Sep 2006 06:29:09 -0700

Tomi NA wrote:

On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote:

On Thu, 7 Sep 2006, Tomi NA wrote:
> On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote:

>> Is there any filter available for extracting text from MSPowerpoint files

>> and indexing them?
>> The lucene website suggests the POI project, which, it seems does not
>> support PPT files as of now.
>
> http://jakarta.apache.org/poi/hslf/index.html
>

> It doesn't say poi doesn't support ppt. It just says support islimited.

> Don't know exactly how limited, but certainly not useless for indexing
> purposes.


Support for editing and adding things to PowerPoint files is limited, as
is getting out the finer points of fonts and positioning.


Which brings me to another (off)topic: can lucene/nutch assign
different weights to tokens in the same document field? An obvious
example would be: "this text seems to be in large, bold, blinking
letters: I'll assume it's more important than the surrounding 8px
text."

No, it can't (at least not yet). As a workaround you can extract theseportions of text to another field (or multiple fields), and then addthem with a higher boost. Then, expand your queries so that they includealso this field. This way, if query matches these special tokens,results will get higher rank because of matching on this boosted field.


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing MS Powerpoint files with Lucene

Reply via email to