Tomi NA wrote:
On 9/7/06, Nick Burch <[EMAIL PROTECTED]> wrote:
On Thu, 7 Sep 2006, Tomi NA wrote:
> On 9/7/06, Venkateshprasanna <[EMAIL PROTECTED]> wrote:
>> Is there any filter available for extracting text from MS Powerpoint files
>> and indexing them?
>> The lucene website suggests the POI project, which, it seems does not
>> support PPT files as of now.
>
> http://jakarta.apache.org/poi/hslf/index.html
>
> It doesn't say poi doesn't support ppt. It just says support is limited.
> Don't know exactly how limited, but certainly not useless for indexing
> purposes.

Support for editing and adding things to PowerPoint files is limited, as
is getting out the finer points of fonts and positioning.

Which brings me to another (off)topic: can lucene/nutch assign
different weights to tokens in the same document field? An obvious
example would be: "this text seems to be in large, bold, blinking
letters: I'll assume it's more important than the surrounding 8px
text."

No, it can't (at least not yet). As a workaround you can extract these portions of text to another field (or multiple fields), and then add them with a higher boost. Then, expand your queries so that they include also this field. This way, if query matches these special tokens, results will get higher rank because of matching on this boosted field.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to