Re: Indexing MS Powerpoint files with Lucene

Tomi NA Thu, 07 Sep 2006 05:22:28 -0700

On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote:

On Thu, 7 Sep 2006, Tomi NA wrote:
> On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote:
>> Is there any filter available for extracting text from MS Powerpoint files
>> and indexing them?
>> The lucene website suggests the POI project, which, it seems does not
>> support PPT files as of now.
>
> http://jakarta.apache.org/poi/hslf/index.html
>
> It doesn't say poi doesn't support ppt. It just says support is limited.
> Don't know exactly how limited, but certainly not useless for indexing
> purposes.


Support for editing and adding things to PowerPoint files is limited, as
is getting out the finer points of fonts and positioning.


Which brings me to another (off)topic: can lucene/nutch assign
different weights to tokens in the same document field? An obvious
example would be: "this text seems to be in large, bold, blinking
letters: I'll assume it's more important than the surrounding 8px
text."

t.n.a.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing MS Powerpoint files with Lucene

Reply via email to